Business Analytics
Business Analytics Life Cycle
Learning Objectives
By the end of this lesson, you will be able to:
Explore business analytics life cycle phases to define the
roadmap and achieve business goals
List the challenges faced in each phase
Perform and outline business analytics life cycle phases with
the help of loan default prediction case study
Business Analytics Life Cycle (BALC)
Introduction
BALC is a framework that describes the process of using data and analytics to drive business decisions.
The phases involved are:
Data exploration Data modeling
Data collection Data deployment
Business Monitoring and
understanding maintenance
Business Understanding: Overview
This phase involves understanding and addressing the business problem or opportunity.
Identifying the
stakeholders
Establishing the Defining the scope
goals and objectives of the problem
Business Understanding: Example
A retail company wants to improve its customer retention. The phase would involve:
Defining the problem Defining success metrics
Identifying stakeholders Identifying data sources
Establishing the scope Determining objectives
Business Understanding: Challenges
Insufficient Inadequate
Ambiguous Limited data Frequent changes
domain stakeholder
problem definition availability in business needs
expertise involvement
Data Collection: Overview
Data is collected from various sources, including internal and external sources.
Cleanse Integrate
Steps for
preparing data for
analysis:
Transform
Data Collection: Example
After completing the business understanding phase, the retail company will collect data.
Customer transaction Social media
Data
Demographic collected can Competitor
be related to:
Customer feedback Website traffic
Data Collection: Challenges
Data quality Data security Data integration Data volume
Data availability
and privacy
To overcome these challenges, it is essential to have a structured approach to data collection.
Data Exploration: Overview
The goal of data exploration is to gain insights and identify patterns, trends, and outliers that can
inform subsequent analysis.
Data
visualization
Descriptive Correlation
statistics analysis
Data
exploration
techniques:
Data cleaning Outlier detection
Data Exploration: Example
After completing the data collection phase, the retail company will explore the collected data.
Examples of data exploration techniques are:
Descriptive statistics: Calculate summary statistics
Data visualization: Create visualizations
Correlation analysis: Calculate correlation coefficients
between variables
Outlier detection: Identify and investigate outliers
Data cleaning: Identify and address missing values
Data Exploration: Challenges
Bias and Data privacy and
Data quality issues Data complexity Time constraints
subjectivity security
Data Modeling: Overview
It creates a mathematical representation of the data that captures the relationships
between different variables.
Descriptive Predictive
Types of data
models:
Prescriptive
Data Modeling: Example
After completing the data exploration phase, the retail company can use the following
data modeling approaches:
Define the problem Apply the model
Clean and preprocess the
Validate the model
data
Select the modeling
Train the model
technique
Data Modeling: Challenges
Data privacy and
Data quality
security
Overfitting Interpretability
Underfitting Model selection
Deployment: Overview
Model deployment is the process of integrating a data model into a production environment to
generate predictions or support decision-making. It involves:
Preparing the
model
Selecting a
Monitoring and
deployment
maintenance
environment
Testing and Integrating with
validation other systems
Deployment: Challenges
Model drift
Security User adoption
Scalability Regulatory compliance
Integration with
existing systems Data governance
A successful model deployment requires planning, testing, and maintenance to meet the needs.
Monitoring and Maintenance: Overview
It is essential for ensuring the accuracy, reliability, and usefulness of data-driven insights.
Performance
monitoring
Data quality
Model validation
monitoring
Some key
considerations
are:
Continuous Data security
improvement
It is an ongoing process that requires regular attention and adjustment.
Monitoring and Maintenance: Techniques
Error analysis
Performance monitoring Feedback loops
Automated testing Versioning
Regular retraining Security monitoring
Case Study: Loan Default Prediction
Problem Statement
When customers fail to pay their loans on time, banks suffer losses. These losses, which amount to
millions of dollars every year, have a significant impact on a country's economic growth.
In this case study, you will predict whether a person will default on a loan by examining various
factors such as location, loan balance, funded amount, and more.
A training and testing dataset of 67,463 rows by 35 columns and 28,913 rows by 34 columns,
respectively, is provided.
Source: [Link]
Data Description
• ID (Int): Unique ID of a representative
• Loan amount (Int): Loan amount applied • Subgrade (Object): Subgrade by the bank
• Funded amount (Int): Loan amount funded • Employment duration (Object): Duration
• Funded amount investor (Float): Loan amount • Home ownership (Float): Ownership of home
approved by the investors • Verification status (Object): Income verification
• Term (Int): Term of the loan (in months) by the bank
• Batch enrolled (Object): Batch numbers to • Payment plan (Object): If any payment plan has
representatives been started against the loan
• Interest rate (Float): Interest rate (%) on loan • Loan title (Object): Loan title provided
• Grade (Object): Grade by the bank
Data Description
• Revolving Balance (Int): Total credit revolving
• Debit to income (Float): Ratio of the
balance
representative's total monthly debt
• Revolving utilities (Float): Amount of credit a
repayment divided by self-reported monthly
representative is using relative to the
income excluding mortgage
revolving balance
• Delinquency two years (Int): Number of 30+
• Total accounts (Int): Total number of credit
days of delinquency in the past two years
lines available in a representative credit line
• Inquires in six months (Int): Total number of
• Initial list status (Object): Unique listing status
inquiries in the last six months
of the loan (W for waiting and F for forwarded)
• Open account (Int): Number of open credit
• Total received interest (Float): Total interest
lines in the representative's credit line
received to date
• Public record (Int): Number of derogatory
• Total received late fee (Float): Total late fee
public records
received to date
Data Description
• Recoveries (Float): Post charge-off gross
recovery • Accounts delinquent (Int): Number of accounts
• Collection recovery fee (Float): Post charge-off on which the representative is delinquent
collection fee • Total collection amount (Int): Total collection
• Collection12 months medical (Int): Total amount from all accounts
collections in the last 12 months, excluding • Total current balance (Int): Total current
medical collections balance from all accounts
• Application type (Object): Indicates when the • Total revolving credit limit (Int): Total revolving
representative is an individual or joint credit limit
• Last week’s pay (Int): Indicates how long (in • Loan status (Int): 1 = defaulter, 0 = non-
weeks) a representative has paid EMI after the defaulter (Target feature)
batch enrolled
Data Understanding
There are 67,463 observations and 35 features in the training dataset.
• Out of 35 features, there are:
o 9 features of datatype float
o 17 features of datatype int
o 9 features of datatype object
• Feature ID is the identifier
• Loan Status is the target feature
Data Understanding: Target
The target variable indicates the presence of imbalanced data.
Non Defaulters
90.75% (61,222)
9.25% (6,241)
Defaulters
Problem with Imbalance Data
It refers to a situation where the distribution of classes in the dataset is unequal.
Difficult to detect rare events Biased model performance
Some of the
common problems
are:
Inaccurate evaluation metrics
Techniques to Address Imbalanced Data
Oversampling
Undersampling Cost-sensitive learning
Changing the Ensemble learning
performance metric
Data Exploration: Examples
Univariate analysis
Data Exploration: Examples
Univariate analysis
Interest rate Debit to income
Data Exploration: Examples
Bivariate analysis
Data Exploration: Examples
Bivariate analysis
Data Preparation
Missing values: There are no missing values in the data.
Duplicate values: There are no duplicate values in the data.
Low variance features:
1. Constant features (Variance = 0)
2. Quasi-constant features (Variance = 0.02)
• Feature accounts delinquent has variance = 0
• Collection 12 months medical and accounts delinquent are quasi-constant features.
Data Preparation
Per box plots, the following features have
outliers:
• Funded amount investor
• Interest rate
• Home ownership
Outliers and anomalies
• Open account
• Percentile method
• Revolving balance
• IQR method
• Total accounts
• Box plot method
• Total received interest
• Total received late fee
• Recoveries
• Collection recovery fee
• Total collection amount
• Total current balance
• Total revolving credit limit
Hypothesis Generation
Check if the Target variable has a significant correlation with the Input features
Hypothesis Generation
Check if there is any kind of pattern between the Initial list status and the Loan status
Hypothesis Generation
Check if Subgrade is associated with the Loan status
Hypothesis Generation
On similar lines, you can check the effect of:
• Application type
• Collection 12 months medical
• Term
• Employment duration
• Public record
• Inquiries - six months
• Grade on target feature that is loan status
Outlier Treatment
Once outliers are identified, you need to decide on the appropriate treatment.
Removal Transformation Imputation
By using these options, outliers can be removed.
Feature Encoding
It is the process of converting categorical variables into numerical values that can be used for
analysis or modeling. Techniques for feature encoding are:
One hot
Label Ordinal
Binary Encoding
This technique creates binary columns for a categorical variable by using binary numbers.
Techniques for binary encoding are:
1 2 3
Count Target Hashing
Categorical features
• Batch enrolled- 41
• Grade- 7
• Subgrade- 35
• Employment duration- 3
• Verification status- 3
• Payment plan- 1
• Loan title- 109
• Initial list status- 2
• Application type- 2
Data Pre-processing
Feature wise roadmap of data pre-processing are as follows:
• Batch enrolled: Remove BAT and typecast into it
• Grade: Ordinal
• Sub grade: Ordinal and too many unique values
• Employment duration: Manually typecast
• Verification status: Manually typecast
• Payment plan: Drop
• Loan title: Too many unique values
• Initial list status: Binary nominal
• Application type: Binary nominal
Model Selection
A number of models are tried and tested before deciding which one gives the better result.
Loan Default Prediction
Decision tree performance: Bagging classifier performance:
Loan Default Prediction
Boosting algo performance: Logistic regression performance:
Final Model
• You will use the XGB model as it gives better results.
• Next step would be to fine-tune the model for better precision
and recall.
Model Deployment
Business Production machine learning
inputs
Pipeline
Packaging*
hardening Model
Data science Deploy Monitoring
(Data hardening
engineering)
Data
engineering
Model Model Feature
Model catalog
security governance catalog
Data catalog
Model Deployment: Approach
Considerations
ML architectures
• Modularity
• Train by the batch; predict on the fly; serve via REST API
• Reproducibility
• Train by the batch; predict by the batch; serve through a shared
• Scalability
database
• Extensibility
• Train and predict by streaming
• Testing
• Train by the batch; predict on the mobile (or by other clients)
• Automation
Model Deployment: Comparison
Model Deployment: High Level Architecture
Evaluation layer Scoring layer Feature layer Data layer
Monitoring and Maintenance
Monitoring
Production machine learning needs:
• Monitoring mechanism that is model agnostic
• Instrumentation of both the data flow in and the model
performance metrics out
• To collect performance metrics
Key Takeaways
The BALC is a framework that describes the process of using
data and analytics to drive business decisions.
The business understanding phase involves understanding the
business problem or opportunity that needs to be addressed.
The data collected from various sources is summarized and
visualized to understand the key characteristics of a dataset.
Successful model deployment requires planning, testing, and
maintenance to meet the needs.