Report Final FINAL
Report Final FINAL
Report Final FINAL
on
A CHURN PREDICTION MODEL: ANALYSIS OF MACHINE LEARNING
TECHNIQUES FOR CHURN PREDICTION AND FACTOR IDENTIFICATION
Submitted in partial fulfillment of the requirements for the award of the
degreeof
BACHELOR OF ENGINEERING
BY
NAFEESA IBRAHIM 160317733013
SARAH SIDDIQUI 160317733014
MARZIA RAZA 160317733016
This is to certify that the project report on “A Churn Prediction Model: Analysis of
Machine Learning Techniques for Churn Prediction and Factor
Identification” being submitted by Ms. Nafeesa Ibrahim (160317733013), Ms.
Sarah Siddiqui (160317733014), Ms. Marzia Raza (160317733016) in partial
fulfilment for the award of the Degree of Bachelor of Engineering in Computer Science
by the Osmania University is a record of Bonafide work carried out by them under my
guidance and supervision.
The results embodied in this project report have not been submitted to any other
University or Institute for the award of any Degree or Diploma.
HEAD OF DEPARTMENT
DR. SYED RAZIUDDIN
Professor and Head Department of CSE
DCET,Hyderabad
DECLARATION
This is to certify that the work reported in the present project entitled “A Churn Prediction
Model: Analysis of Machine Learning Techniques for Churn Prediction and Factor
Identification” is a record of work done by us in the Department of Computer Science &
Engineering, Deccan College of Engineering and Technology, Osmania University. The reports
are based on the project work done entirely by us and not copied from any other source.
The results presented in this dissertation have been verified and are found to be satisfactory. The
results embodied in the dissertation have not been submitted to any other university for the
award of any degree or diploma.
i
ACKNOWLEDGEMENT
ii
ABSTRACT
In the telecom sector, a huge volume of data is being generated daily due to a vast client base. Decision makers
and business analysts emphasized that attaining new customers is costlier than retaining the existing ones.
Business analysts and customer relationship management (CRM) analyzers need to know the reasons for
churn customers, as well as behavior patterns from the existing churn customers’ data. The paper reviews the
relevant studies on Customer Churn Analysis on Telecommunication Industry. Churn Analysis is one of the
worldwide used analysis on Subscription Oriented Industries to analyze customer behaviors to predict the
customers which are about to leave the service agreement from a company. The proposed model first classifies
churn customers data using classification algorithms, in which the Random Forest (RF) and Decision Tree
(DT) algorithm performed with 66.44% correctly classified instances. We work with various algorithms to
understand which one is most accurate.
iii
LIST OF TABLES
iv
LIST OF FIGURES
Seq. no Name of The Figure Page no
v
LIST OF ABBREVIATIONS
DL - deep learning
LR - logistic regression
ML -machine learning
RF - random forest
LR - logistic regression
vi
TABLE OF CONTENTS
Declaration i
Acknowledgement ii
Abstract iii
List of Tables iv
List of Figures v
List of Abbreviations vi
1.2 Adaptability 17
1.3 Problem Statement 17
1.4 Objective 17
1.5 Application 18
1.6 Scope 18
2.4 21
Customer churn prediction fortelecommunication
Data mining as a tool to predict the churn behaviour among
2.5 22
Indian bank customers
vii
Chapter 3 SYSTEM ANALYSIS 23-27
3.1 Software Development Life Cycle 22
viii
7.5 Results 65
Chapter 8 CONCLUSION 68
8.1 Conclusion 68
ix
Chapter 1
INTRODUCTION
This project reviews the relevant studies on Customer Churn Analysis on Telecommunication
Industry. Churn Analysis is one of the worldwide used analysis on Subscription Oriented
Industries to analyze customer behaviors to predict the customers which are about to leave the
service agreement from a company. The proposed model first classifies churn customers data using
classification algorithms, in which the Random Forest (RF) and Decision Tree (DT) algorithm
performed with 66.44% correctly classified instances. We work with various algorithms to
understand which one is most accurate.
Machine Learning is a system that can learn from example through self-improvement andwithout
being explicitly coded by programmer. The breakthrough comes with the idea thata machine can
singularly learn from the data (i.e., example) to produce accurate results. Machine learning
combines data with statistical tools to predict an output. This output is then used by corporate to
makes actionable insights. Machine learning is closely related to data mining and Bayesian
predictive modeling. The machine receives data as input, use analgorithm to formulate answers.
Machine learning is used for a variety of task like fraud detection, predictive maintenance,portfolio
optimization, automatize task and so on.
DATA RULES
COMPUTE
OUTPUT R
Machine Learning
- 10
Machine learning is the brain where all the learning takes place. The core objective of machine
learning is the learning and inference. First, the machine learns through the discovery of patterns.
This discovery is made thanks to the data. One crucial part of the data scientist is to choose
carefully which data to provide to the machine. The list of attributes used to solve a problem is
called a feature vector.
The machine uses some fancy algorithms to simplify the reality and transform thisdiscovery into
a model. Therefore, the learning stage is used to describe the data and summarize it into a model.
The life of Machine Learning programs is straightforward and can be summarized in the following
points:
1. Define a question
2. Collect data
3. Visualize data
4. Train algorithm
5. Test the Algorithm
6. Collect feedback
7. Refine the algorithm
8. Loop 4-7 until the results are satisfying
9. Use the model to make a prediction
Once the algorithm gets good at drawing the right conclusions, it applies that knowledge to new
sets of data.
Machine learning can be grouped into two broad learning tasks: Supervised and Unsupervised.
- 11
An algorithm uses training data and feedback from humans to learn the relationship of given inputs
to a given output. For instance, a practitioner can use marketing expense andweather forecast as
input data to predict the sales of cans. There are two categories of supervised learning:
o Classification task
o Regression task
Classification
It predicts the class of objects whose class label is unknown. Its objective is to find a derived model
that describes and distinguishes data classes or concepts. The Derived Model is based on the
analysis set of training data i.e., the data object whose class label iswell known. Following are the
examples of cases where the data analysis task is Classification: A bank loan officer wants to
analyze the data to know which customer (loanapplicant) are risky or which are safe.
In both above examples, a model or classifier is constructed to predict the categorical labels. These
labels are risky or safe for loan application data and yes or no for marketingdata.
How does Classification work? With the help of the bank loan application that we have discussed
above, let us understand the working of classification. The Data Classificationprocess includes
two steps −
- 12
● Using Classifier for Classification
In this step, the classifier is used for classification. Here the test data is used to estimate
the accuracy of classification rules. The classification rules can be appliedto the new data
tuples if the accuracy is considered acceptable.
Regression
Regression methods are used to predict the value of the response variable from one or
more predictor variables where the variables are numeric. Listed below arethe forms of
Regression −
o Linear
o Multiple
o Weighted
- 13
o Polynomial
o Nonparametric
o Robust
Unsupervised learning is a type of algorithm that learns patterns from untagged data. The hope is
that through mimicry, the machine is forced to build a compact internal representation of its world.
Some of the types of unsupervised learning methods are: K- means clustering, KNN (k-nearest
neighbors), Hierarchal clustering, Anomaly detection, Neural Networks, Principle Component
Analysis, Independent Component Analysis, Apriori algorithm.
Augmentation: Machine learning, which assists humans with their day-to-day tasks, personally
or commercially without having complete control of the output. Such machine learning is used in
different ways such as Virtual Assistant, Data analysis, software solutions. The primary user is to
reduce errors due to human bias.
Automation: Machine learning, which works entirely autonomously in any field without the need
for any human intervention. For example, robots performing the essential process steps in
manufacturing plants.
Finance Industry: Machine learning is growing in popularity in the finance industry. Banks are
mainly using ML to find patterns inside the data but also to prevent fraud.
Deep learning is a computer software that mimics the network of neurons in a brain. It is asubset
of machine learning and is called deep learning because it makes use of deep neuralnetworks. The
machine uses different layers to learn from the data. The depth of the modelis represented by the
number of layers in the model. Deep learning is the new state of the art in term of AI. In deep
learning, the learning phase is done through a neural network.
Reinforcement learning is a subfield of machine learning in which systems are trained by receiving
virtual "rewards" or "punishments," essentially learning by trial and error.Google's DeepMind has
used reinforcement learning to beat a human champion in the Gogames. Reinforcement learning
is also used in video games to improve the gamingexperience by providing smarter bot. Some of
the most famous algorithms are:
- 14
● Q-learning
● Deep Q network
● State-Action-Reward-State-Action (SARSA)
● Deep Deterministic Policy Gradient (DDPG)
AI in Finance: The financial technology sector has already started using AI to save time,reduce
costs, and add value. Deep learning is changing the lending industry by using morerobust credit
scoring. Credit decision-makers can use AI for robust credit lending applications to achieve faster,
more accurate risk assessment, using machine intelligence to factor in the character and capacity
of applicants.
AI in Marketing: AI is a valuable tool for customer service management
and personalization
challenges. Improved speech recognition in call-center management and call routing as a result of
the application of AI techniques allows a more seamlessexperience for customers.
Artificial Intelligence
Machine Learning
ML
Deep
Learning
TensorFlow: The most famous deep learning library in the world is Google's TensorFlow.Google
product uses machine learning in all of its products to improve the search engine, translation, image
captioning or recommendations. To give a concrete example, Google users can experience a faster
and more refined the search with AI. If the user types a keyword a the search bar, Google provides
a recommendation about what could be the nextword.
Google wants to use machine learning to take advantage of their massive datasets to give users the
best experience. Three different groups use machine learning:
● Researchers
● Data scientists
- 15
● Programmers.
They can all use the same toolset to collaborate with each other and improve their efficiency.
Google does not just have any data; they have the world's most massive computer, so TensorFlow was
built to scale. TensorFlow is a library developed by the Google Brain Team to accelerate machine
learning and deep neural network research.
It was built to run on multiple CPUs or GPUs and even mobile operating systems, and it has several
wrappers in several languages like Python, C++ or Java.
- 16
1.2 Adaptability
There are several reasons why retaining existing customers is important. The first reason is that
markets have become saturated to a point where new customer in the business is scarce. The second
reason is cost. New customer acquisition can be costly to a business for a numerous reasons. It has
been reported that the acquisition of new customers can be over ten times more costly to a business
than retaining existing customers. This is largely because in saturated markets, the acquisition of new
customers often involves enticing customers away from competitors through offers of expensive
special deals
Customer Churn Prediction (CCP) is a challenging activity for decision makers and
machine learning community because most of the time, churn and non-churn
customers have resembling features. From different experiments on customer churn and
related data, it can be seen that a classifier shows different accuracy levels for different
zones of a dataset. In such situations, a correlation can easily be observed in the level of
classifier's accuracy and certainty of its prediction. If a mechanism can be defined to
estimate the classifier's certainty for different zones within the data, then the expected
classifier's accuracy can be estimated even before the classification.
1.4 Objective
The goal is to present the details about the availability of public datasets and what kinds
of customer details are available in each dataset for predicting customer churn.
Secondly, we compare the various predictive modeling methods that have been
used in the literature for predicting the churners using different categories of customer
records, and then quantitatively compare their performances. Finally, we summarize
what kinds of performance metrics have been used to evaluate the existing churn
prediction methods. Analyzing all these three perspectives is very crucial for developing
a more efficient churn prediction system for telecom industries.
- 17
1.5 Application
The application Customer Churn Model is a Python application which gives us the
advantage of predicting whether a customer would stay or leave a company based on a
number of attributes provided and collected from existing databases. The tendency of a
subscriber moving on to a different competitor is calculated using different algorithms
providing the best predictability.
1.6 Scope
The scope of our project can be summarized as follows:
The telecommunications sector has become one of the main industries in developed
countries. The technical progress and the increasing number of operators raised the level
of competition. Companies are working hard to survive in this competitive market
depending on multiple strategies. Three main strategies have been proposed to generate
more revenues: acquire new customers, upsell the existing customers, and increase the
retention period of customers. However, comparing these strategies taking the value of
return on investment (RoI) of each into account has shown that the third strategy is the
most profitable strategy, proves that retaining an existing customer costs much lower than
acquiring a new one, in addition to being considered much easier than the upselling
strategy. To apply the third strategy, companies have to decrease the potential of
customer’s churn. The application helps in recognizing the pattern.
- 18
1.7 Organization of Thesis
The proposed dissertation consists of ten chapters including Introduction andConclusion.
Chapter 1 presents problem statement, objective, applications and scope and Chapter 2
emphasizes on detailed literature survey. Chapter 3 describes the system analysis with
existing system, advantages and disadvantages of existing system, proposed system and
its advantages. Chapter 4 describes system specifications with feasibility study technical
approach and architecture. Chapter 5 describes the system design with UML Diagrams.
Chapter 6 describes implementation, algorithm used and sample coding. Chapter 7
specifies testing strategies and methodologies along with test cases and sample
screenshots. Chapter 8 describes conclusion and Chapter 9 the future enhancement.
- 19
Chapter 2
LITERATURE SURVEY
2.1. Title: Customer churn management: Retaining high-margin customerswith
customer relationship management techniques
Author: C. Geppert
- 20
2.3. Title: A rule-based method for customer churn prediction in
telecommunication services
Author: Y. Huang, B. Huang, and M.-T. Kechadi
methods are not able to provide the prediction probability which is helpful for
algorithm, called CRL. CRL applies several heuristic methods to learn a set of rules,
and then uses them to predict the customer potential behaviors. The experiments
Among the evaluated methods, the random forest feature selector has not yet
been widely compared to the other methods. In our evaluation, we test how
the implemented feature selection can affect (i.e. improve) the accuracy of
that ReliefF and random forest enabled the classifiers to achieve the highest
2.5. Title: Data mining as a tool to predict the churn behaviour among Indian
bank customers
Author: M. Kaur, K. Singh, and N. Sharma
ABSTRACT: Customer is the heart and soul of any organization. The era of globalization and
cutthroat competition has changed the basic concept of marketing, now marketing is not confined
to selling the products to the customers, but the objective is to reach to the hearts of the customers
so that they feel belongingness towards the organizations and hence should remain the loyal
customers. In the dynamic market scenarios where companies are coming up with varied options
every now and then, customer retention is a critical area to ponder upon, as customers usually
churn from one company to another quite often and this too is happening at an alarming rate and
is becoming the most important issue in customer relationship management. So, prediction of
the customer behavior and hence taking remedial actions beforehand is the need of the hour. But
the ever growing data bases make it difficult to analyze the data and to forecast the future
trends.The solution lies in the use of Data Mining tools for predicting the churn behavior of the
customers. This paper throws light on the underlying technology and the perspective applications
of data mining in predicting the churn behavior of the customers and hence paving path for better
customer relationship management.
- 22
Chapter 3
SYSTEM ANALYSIS
3.1 Software Development Life Cycle (SDLC)
There are various software development approaches defined and designed which are
used/employed during development process of software, these approaches are also
referred as “Software Development Process Models” (e.g. Waterfall model, Incremental
model, V-model, iterative model, etc.). Each process model follows a particular life cycle
in order to ensure success in process of software development. Software life cycle models
describe phases of the software cycle and the order in which those phases are executed.
Each phase produces deliverables required by the next phase in the life cycle.
Requirements are translated into design. Code is produced according to the designwhich
is called development phase. After coding and development, the testing verifies the
deliverable of the implementation phase against requirements.
- 23
Design: In this phase the system and software design is prepared from the requirement
specifications which were studied in the first phase. System Design helps in
specifying hardware and system requirements and also helps in defining overall
system architecture. The system design specifications serve as input for the next phase
of the model.
Implementation / Coding: On receiving system design documents, the work is divided
in modules/units and actual coding is started. Since, in this phase the code isproduced
so it is the main focus for the developer. This is the longest phase of the software
development life cycle.
Testing: After the code is developed it is tested against the requirements to make sure
that the product is actually solving the needs addressed and gathered during the
requirements phase. During this phase unit testing, integration testing, system testing,
acceptance testing are done.
Deployment: After successful testing the product is delivered / deployed to the
customer for their use.
Maintenance: Once when the customers starts using the developed system then the
actual problems comes up and needs to be solved from time to time. This process
where the care is taken for the developed product is known as maintenance.
- 24
Figure (3.1): Waterfall Model
- 26 -
3.3 Proposed System
The proposed churn prediction model is evaluated using metrics, such as accuracy, precision,
recall, f-measure, and receiving operating characteristics (ROC) area. The results will reveal
that our proposed churn prediction model produced better churn classification. For
classification purpose we are using multiple Machine Learning algorithms like Decision tree,
Random Forest, Linear Regression (LR), Support Vector Machine (SVM), XGBoost and
AdaBoost classifiers and then we compare the results with highest accuracy classifier. Hence,
we improvised the accuracy and performance.
- 27 -
Chapter 4
SYSTEM SPECIFICATION
4.1 Feasibility Study
Feasibility analysis tells how the present system is compatible with the resource present
with developing team. The objective is to determine quickly at the minimum expense how
to solve a problem. The following feasibility studies are conducted to the feasibilityof
proposed system. Feasibility study is conducted once the problem is clearly understood.
The feasibility study which is a high-level capsule version of the entire system analysis
and design process. The objective is to determine whether the proposed system is feasible
or not and it helps us to the minimum expense of how to solve the problem and to
determine, if the Problem is worth solving. The feasibility study is termed in three ways
as followed:
- 28 -
currently possesses. In others, a workforce may simply need to improve their skills.
Behavioral feasibility is as much about “can they use it” as it is about “will they use
it.”
- 29 -
4.4.2 Hardware Requirements
Processor - any quad core mobile processor
Speed - 2.40 GHz
RAM - 1GB (min)
Storage - 100 mb (min)
A. DATA PREPROCESSING
1) NOISE REMOVAL It is very important for making the data useful because noisy data
can lead to poor results. In telecom dataset, there are a lot of missing values, incorrect
values like ‘‘Null’’ and imbalance attributes in the dataset. In our dataset, the
- 30 -
number of features is 21. We analyzed the dataset for filtering and reduced the number of
features so that it contains only useful features. A number of features are filtered using
the delimiter function in Python. TABLE 1 shows the 21 features which are available in
the dataset.
Table 1
There are two types of customers in the telecom dataset. First, are the non-churn
customers; they remain loyal to the company and are rarely affected by the competitor
companies. The second type is churn customers. The proposed model targets churn
customers and identify the reasons behind their migration.
- 31 -
Furthermore, it devises retention strategies to overcome the problem of switching to other
companies. In this study, a range of machine learning techniques is used for classifying
customers’ data using the labeled datasets. It is to assess which of the algorithm best
classifies the customers into the churn and non-churn categories.
- 32 -
Chapter 5
SYSTEM DESIGN
5.1 UML Diagrams
UML stands for Unified Modelling Language. UML is a standardized general purpose
modelling language in the field of object-oriented software engineering. The standard is
managed, and was created by, the Object Management Group. The goal is for UML to
become a common language for creating models of object-oriented computer software.
In its current form UML is comprised of two major components: A Meta-model and
annotation. In the future, some form of method or process may also be added to; or
associated with, UML. The Unified Modelling Language is a standard language for
specifying, Visualization, Constructing and documenting the artifacts of software system,
as well as for business modelling and other non-software systems. The UML represents a
collection of best engineering practices that have proven successful in the modelling of
large and complex systems. The UML is a very important part of developing object-
oriented software and the software development process. The UML uses mostly graphical
notations to express the design of software projects.
GOALS- The Primary goals in the design of the UML are as follows:
Provide users a ready-to-use, expressive visual modeling Language so that they can
develop and exchange meaningful models.
Be independent of particular programming languages and development process.
Provide a formal basis for understanding the modeling language.
Encourage the growth of OO tools market.
Support higher level development concepts such as collaborations, frameworks,
patterns and components.
Integrate best practices.
5.1.1 Class Diagram
In software engineering, a class diagram in the Unified Modelling Language (UML) is a
type of static structure diagram that describes the structure of a system by showing the
system's classes, their attributes, operations (or methods), and the relationships among the
classes. It explains which class contains information.
- 33 -
Fig (5.1): class diagram
The class diagram displays the major classes that are present that help in the functioning
of the application. The Class that helps in the creation of mesh network is The User Class.
This class has important functions that initiate the processing, feature extraction, selecting
the important features or attributes. It then chooses and applies the different algorithms
used in the model to predict the accuracy, precision, recall, and create a visualization.
- 34 -
5.1.2 Use Case Diagram
A use case diagram in the UML is a type of behavioural diagram defined by and created
from a Use-case analysis. Its purpose is to present a graphical overview of the
functionality provided by a system in terms of actors, their goals (represented as use
cases), and any dependencies between those use cases. The main purpose of a use case
diagram is to show what system functions are performed for which actor. Roles of the
actors in the system can be depicted.
- 35 -
The User provides the dataset containing the customer information and the features with
the values containing both binary and words. The system takes will the values, checks for
null values and uses the most important ones while converting it to binary values for easier
prediction by algorithms. The Algorithms use the values of the attributes to predict a
model that would then be compared by the user and check which algorithms has the better
predictability factor.
- 36 -
5.1.4 Activity Diagram
Activity diagrams are graphical representations of workflows of stepwise activities and
actions with support for choice, iteration and concurrency. In the Unified Modelling
Language, activity diagrams can be used to describe the business and operational step-
by-step workflows of components in a system. An activity diagram shows the overall
flow of control.
- 37 -
5.2 Flow Chart Diagram
A flowchart is a diagram that depicts a process, system or computer algorithm. They are
widely used in multiple fields to document, study, plan, improve and communicate
often complex processes in clear, easy-to-understand diagrams. Flowcharts, sometimes
spelled as flow charts, use rectangles, ovals, diamonds and potentially numerous other
shapes to define the type of step, along with connecting arrows to define flow and
sequence. They can range from simple, hand-drawn charts to comprehensive computer-
drawn diagrams depicting multiple steps and routes. Flowcharts are sometimes called by
more specialized names such as Process Flowchart, Process Map, Functional Flowchart,
Business Process Mapping, Business Process Modeling and Notation (BPMN), or Process
Flow Diagram (PFD). They are related to other popular diagrams, such as Data Flow
Diagrams (DFDs) and Unified Modeling Language (UML) Activity Diagrams.
- 38 -
Chapter 6
IMPLEMENTATION
- 39 -
6.2. Implementation
6.2.1. Installation of Anaconda Navigator
Anaconda Navigator is a desktop graphical user interface (GUI) included in Anaconda distribution
that allows you to launch applications and easily manage conda packages, environments and channels
without using command-line commands. Navigator can search for packages on Anaconda Cloud or in
a local Anaconda Repository. It is available for Windows, mac OS and Linux.
In order to run, many scientific packages depend on specific versions of other packages. Data scientists
often use multiple versions of many packages and use multiple environments to separate these different
versions.
The command line program conda is both a package manager and an environment manager, to help data
scientists ensure that each version of each package has all the dependencies it requires and works
correctly.
Navigator is an easy, point-and-click way to work with packages and environments without needing to
type conda commands in a terminal window.
The following applications are available by default in Navigator:
JupyterLab
JupyterNotebook
- 40
has optional dependancies (like Matplotlib for plotting). Therefore, the easiest way to get Pandas set
up is to install it through a package like the Anaconda distribution , “a cross platform distribution for
data analysis and scientific computing.”
In order to use Pandas in your Python IDE (Integrated Development Environment) like Jupyter
Notebook or Spyder (both of them come with Anaconda by default), you need to import the Pandas
library first. Importing a library means loading it into the memory and then it’s there for you to work
with. In order to import Pandas all you have to do is run the following code:
• import pandas as pd
• import numpy as np
Usually you would add the second part (‘as pd’) so you can access Pandas with ‘pd.command’ instead
of needing to write ‘pandas.command’ every time you need to use it. Also, you would import numpy
as well, because it is very useful library for scientific computing with Python. Now Pandas is ready
for use! Remember, you would need to do it every time you start a new Jupyter Notebook, Spyder file
etc.
NUMPY
Numpy is one such powerful library for array processing along with a large collection of high-level
mathematical functions to operate on these arrays. These functions fall into categories like Linear
Algebra, Trigonometry, Statistics, Matrix manipulation, etc.
NumPy’s main object is a homogeneous multidimensional array. Unlike python’s array class which
only handles one-dimensional array, NumPy’s ndarray class can handle multidimensional array and
provides more functionality. NumPy’s dimensions are known as axes. For example, the array below
has 2 dimensions or 2 axes namely rows and columns. Sometimes dimension is also known as a rank
of that particular array or matrix.
NumPy is imported using the following command. Note here np is the convention followed for the
alias so that we don't need to write numpyevery time.
• import numpy as np
NumPy is the basic library for scientific computations in Python and this article illustrates some of its
most frequently used functions. Understanding NumPy is the first major step in the journey of
machine learning and deep learning.
- 41
Sklearn
In python, scikit-learn library has a pre-built functionality under sklearn. Pre processing.
Next thing is to do feature extraction Feature extraction is an attribute reduction process. Unlike
feature selection, which ranks the existing attributes according to their predictive significance, feature
extraction actually transforms the attributes. The transformed attributes, or features, are linear
combinations of the original attributes.
Matplotlib
Matplotlib is the most popular python plotting library. It is a low level library with a Matlab like
interface which offers lots of freedom at the cost of having to write more code.
1. To install Matplotlib pip and conda can be used.
2. pip install matplotlib
3. conda install matplotlib
Matplotlib is specifically good for creating basic graphs like line charts, bar charts, histograms and
many more. It can be imported by typing:
• import matplotlib.pyplot as plt
Algorithms
The following are the basic steps involved in performing the random forest algorithm:
- 42
2. Build a decision tree based on these N records.
3. Choose the number of trees needed in your algorithm & repeat steps 1 &2.
4. For classification problem, each tree in the forest predicts the category to which the new
record belongs. Finally, the new record is assigned to the category that wins the majority
vote.
Logistic regression is a statistical technique used to predict probability of binary response based
on one or more independent variables. It means that, given a certain factor, logistic regression is
used to predict an outcome which has two values such as 0 or 1, pass or fail, yes or no, etc.
F(z)=1/1−e−z
where z=w0+w1⋅x1+w2⋅x2+…+wn⋅xn.
Here w0, w1, w2, ..., wn are the regression co-efficients of the model and are calculated by
Maximum Likelihood Estimation and x1, x2, x3, ..., xn are the features or independent variables.
F(z) calculates the probability of the binary outcome and using the probabilities we classify the
given data point(x) into one of the two categories.
- 43
6.5. DECISION TREES:
Decision Trees (DTs) are a non-parametric supervised learning method used for classification
and regression. Decision trees learn from data to approximate a sine curve with a set of if-then-
else decision rules. The deeper the tree, the more complex the decision rules and the fitter the
model.
Decision tree builds classification or regression models in the form of a tree structure. It breaks
down a data set into smaller and smaller subsets while at the same time an associated decision
tree is incrementally developed.
The result is a tree with decision nodes and leaf nodes. A decision node has two or more
branches. Leaf node represents a classification or decision. The topmost decision node in a tree
which corresponds to the best predictor called root node.
Decision trees can handle both categorical and numerical data.
Splitting: The process of partitioning the data set into subsets. Splits are formed on a particular
variable.
Pruning: The shortening of branches of the tree. Pruning is the process of reducing the size of the
tree by turning some branch nodes into leaf nodes, and removing the leaf nodes under the original
branch
Tree Selection: The process of finding the smallest tree that fits the data. Usually this is the tree
that yields the lowest cross-validated error.
- 44
6.6. XGBoost:
XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient
boosting framework. In prediction problems involving unstructured data (images, text, etc.)
artificial neural networks tend to outperform all other algorithms or frameworks. However, when
it comes to small-to-medium structured/tabular data, decision tree based algorithms are considered
best-in-class right now.
A wide range of applications: Can be used to solve regression, classification, ranking, and
user- defined prediction problems.
Portability: Runs smoothly on Windows, Linux, and OS X.
Languages: Supports all major programming languages including C++, Python, R, Java,
Scala, and Julia
Cloud Integration: Supports AWS, Azure, and Yarn clusters and works well with Flink,
Spark, and other ecosystems.
- 45
6.8. Sample Code
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
Chapter 7
TESTING
7.1 Testing Methodologies
The purpose of testing is to discover errors. Testing is the process of trying to discover
every conceivable fault or weakness in a work product. It provides a way to check the
functionality of components, subassemblies, assemblies and/or a finished product It is the
process of exercising software with the intent of ensuring that the software system meets
its requirements and user expectations and does not fail in an unacceptable manner.
Testing is the debugging program is one of the most critical aspects of the computer
programming triggers, without programming that works, the system would never produce
an output of which it was designed. Testing is best performed when user development is
asked to assist in identifying all errors and bugs. The sample data are used for testing. It
is not quantity, but quality of the data used the matters of testing. Testing is aimed at
ensuring that the system was accurately an efficiently before live operation commands.
- 62
characterized phases such as requirement gathering, analysis, design, coding or
execution, testing, and deployment.
- 63
7.1 Test Cases
- 64
7.5. Results:
7.5.1. Evaluation of Algorithms
- 65
7.5.3. Logistic Regression
7.5.4. SVM
Fig (7.4): support vector machine: classification report, confusion matrix and
receiver operating characteristic
- 66
7.5.1. Random Forest:
7.5.5. XGBoost
Conclusion
In the present competitive market of telecom domain, churn prediction is a significant issue of the
CRM (Customer relation management) to retain valuable customers by identifying a similar group of
customers and providing competitive offers/services to the respective groups.
Therefore, in this domain, the researchers have been looking at the key factors of churn to retain
customers and solve the problems of CRM. In this study, a customer churn model is provided for data
analytics and validated through standard evaluation metrics.
The results will show that our proposed churn model performed better by using several machine
learning techniques and which among them is more accurate.
- 68 -
Chapter 9
Future Enhancement
In future we will further investigate eager learning and lazy learning approaches for the
better churn prediction. Application of lazy learning will allow more prediction with less
training time while application of eager learning will allow intense training of data.
The study can be further extended to explore the changing behavioral patterns of churn
customers by applying Artificial Intelligence techniques for prediction and trend
analysis.
- 69 -
REFERENCES
K. B. Oseman, S.B.M. Shukor, N. A. Haris, F. Bakar, Data Mining in Churn Analysis Model for
Telecommunication Industry, Journal of Statistical Modeling and Analytics, Vol. 1 No. 19-27,
2010.
S.V. Nath, Customer Churn Analysis in the Wireless Industry: A Data Mining Approach,
Technical Report, retrieved from http://download.oracle.com/owsf 2003/40332.pdf , April 14,
2014.
https://www.optimove.com/learning-center/customer-churn-prediction-and-prevention
https://towardsdatascience.com/understanding-confusion-matrix-a9ad42dcfd62
https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc
https://medium.com/deep-math-machine-learning-ai/chapter-4-decision-trees-algorithms-
b93975f7a1f1
https://www.medcalc.org/manual/logisticregression.php
https://www.analyticsvidhya.com/blog/2017/09/understaing-support-vector-machine-example-
code/
https://en.wikipedia.org/wiki/Random forest
https://machinelearningmastery.com/gentle-introductionxgboost-applied-machine-learning/
https://datascienceplus.com/predict-customer-churn-logistic-regression-decision-tree-and-
random-forest/
- 70 -