TLDR
In this article, we’ll go over a standard supervised classification task. A classification problem where we predict whether a loan should be approved or not.
Outline
Introduction
Before we begin
How to code
Data Cleaning
Data Visualization
Feature Engineering
Model Training
Conclusion
Introduction
The“Dream Housing Finance” company deals in all home loans. They have a presence across all urban, semi-urban and rural areas. Customer’s here first apply for a home loan and the company validates the customer’s eligibility for a loan. The Company wants to automate the loan eligibility process (real-time) based on customer details provided while filling out online application forms. These details are “Gender”, “Married”, “Education”, “Dependents”, “Income”, “Loan_Amount”, “Credit_History” and others. To automate the process, they have given a problem to identify the customer segments that are eligible for the loan amount and they can specifically target these customers.
Source: Vectorstock.com
Before we begin
Let’s get familiarized with the dataset.
- It consists of dataset attributes for a loan with below-mentioned description.
The different variables present in the dataset are:
Numerical features:Applicant_Income, Coapplicant_Income, Loan_Amount, Loan_Amount_Term and Dependents.
Categorical features: Gender, Credit_History, Self_Employed, Married and Loan_Status.
Alphanumeric Features:Loan_Id.
Text Features:Education and Property_Area.
As mentioned above we need to predict our target variable which is “Loan_Status”. “Loan_Status” can have two values.
Y (Yes):If the loan is approved.
N (No): If the loan is not approved.
So using the training dataset we’ll train our model and predict our target column “Loan_Status”.
How to code
The company will approve the loan for the applicants having good “Credit_History” and who is likely to be able to repay the loans. For that we’ll load the dataset “Loan.csv” in a dataframe to display the first five rows and check their shape to ensure we have enough data to make our model production ready.
1
2
df = pd.read_csv("Loan.cvs")
df.head()
There’re “614” rows and “13” columns which is enough data to make a production ready model. The input attributes are in numerical and categorical form to analyze the attributes and to predict our target variable “Loan_Status''. Let’s understand the statistical information of numerical variables by using the “describe()” function.
1
df.describe()
By “describe()” function we see that there’re some missing counts in the variables “LoanAmount”, “Loan_Amount_Term” and “Credit_History” where the total count should be “614” and we’ll have to pre-process the data to handle the missing data.
Data Cleaning
Data cleaning is a process to identify and correct errors in the dataset that may negatively impact our predictive model. We’ll find the “null” values of every column as an initial step to
.
1
2
# find the null values
df.isnull().sum()
We observe that there are “13” missing values in “Gender”, “3” in “Married”, “15” in “Dependents”, “32” in “Self_Employed”, “22” in “Loan_Amount”, “14” in “Loan_Amount_Term” and “50” in “Credit_History”. The missing values of the numerical and categorical features are “missing at random (MAR)” i.e. the data is not missing in all the observations but only within sub-samples of the data. So the missing values of the numerical features should be filled with “mean” and the categorical features with “mode” i.e. the most frequently occurring values. We use Pandas “fillna()” function for imputing the missing values as the estimate of “mean” and “mode” remains unbiased.
Let’s check the “null” values again to ensure that there are no missing values as it will lead us to incorrect results.
1
df.isnull().sum()
From the above output, we see that there are no values missing and now we can perform the data visualization.
Data Visualization
To gain a few insights about the data we visualize the categorical data before training the model.
Categorical Data
- Categorical data is a type of data that is used to group information with similar characteristics and is represented by discrete labelled groups eg. gender, blood type, country affiliation. You can read the blogs on
for more understanding of datatypes.
Now let’s visualize the numerical features.
Numerical Data
- Numerical data expresses information in the form of numbers eg. height, weight, age. If you are unfamiliar, please read blogs on
.
Feature Engineering
To create a new attribute named “Total_Income” we’ll add two columns “Coapplicant_Income” and “Applicant_Income” as we assume that “Coapplicant” is the person from the same family for eg. Spouse, Father etc. and display the first five rows of the “Total_Income”.
1
2
3
# total income
df['Total_Income'] = df['Applicant_Income'] + df['Coapplicant_Income']
df.head()
“Total_Income” is the last column added in our dataframe as above.
We see that they're extreme values in the range from “0-`10,000” and the data is left skewed which might be possible that some people may have applied for high loans due to specific needs. Very few applicants are in the range of “40,000-80,000”. So we’ll apply log transformation on “Total_Income” to make it closer to normal in the distributed data.
Below is the graph for “Total_Income_Log”.
Data Cleaning
As a part of the data cleaning process, let’s
not affecting the “Loan-Status” as it helps in improving the accuracy of the model and we’ll display the first five rows of the dataframe.
1
2
3
4
# drop unnecessary columns
cols = ['Applicant_Income, 'Coapplicant_Income', "Loan_Amount", "Loan_Amount_Term", "Total_Income", 'Loan_ID', 'Coapplicant_Income_log']
df = df.drop(columns=cols, axis=1)
df.head()
Category Value Mapping
By using “
” we’ll convert the categorical features to numerical features and display the first five rows of the dataframe.
Normalizing Imbalanced Data
Before we start training the model we’ve to normalize imbalanced data. Imbalance data are instances where the number of observations is not the same for all the classes in a classification dataset. You can refer to our guide to learn more about a
.
In our dataset, the target variable “Loan_Status” is highly imbalanced which may result in biased output. So, we’ll balance the data by performing “undersampling” of the data. “Undersampling” is a technique in which it randomly selects examples from the majority class and deletes them from the training dataset.
The data imbalance for “Loan_Status” is seen in the above graph with “68%” representing “1 (Yes)” and “31%” representing “0 (No)”. We performed “undersampling” on the target data having majority values representing “1(Yes)” and randomly deleted samples by performing an “undersampling” operation on the training data. The aim is to reduce the number of samples in the majority class so that they match up to the total number of samples in the minority class.
We selected the indices of the majority class by using the “np.random.choice” function and specified the total no. of samples required of “minority_class” and stored it in a dataframe “random_majority_indices”. Then we’ve concatenated the indices of the “minority_class” and “random_majority_indices” and stored the output in a dataframe ”under_sample_indices”. For balancing the data we filtered out the samples from “under_sample_indices” dataframe and saved in a new dataframe “under_sample”. The data is balanced as shown below plotted on a graph and is now ready for training the model.
For balancing the data we filtered out the samples from the “under_sample_indices” dataframe and saved them in a new dataframe “under_sample”. The data is balanced as shown below plotted on a graph and is now ready for training the model.
Model Training
Now, It’s time to train the model!! For this, we’ll split the data where we keep “33%” of the test data and the remaining for training data. We’ll perform “cross-validation” for better performance of the model and check the accuracy of each model in percent.
We’ll train the model using “Logistic Regression” and check the accuracy of the model. “Logistic Regression” is a popular classification algorithm that is used to predict a binary outcome i.e. “yes/No”.
After implementing the machine learning algorithm the accuracy obtained by “LogisticRegression” is 68%. Let’s plot the confusion matrix in the testing model and get the summary of the predicted results. To learn more about the
you can refer to our lesson.
From the above confusion matrix, we derive that model predicted “120” for “0 (No)” correctly and “128” for “1 (Yes)” correctly.
Conclusion
Mage here provides us with a low code magical solution with very less effort after data cleaning with only a few clicks. Let’s train the model on Mage and check the accuracy of the model.
Fill-in missing values using Mage
We fill-in the values of categorical features with “mode” and numerical features with “mean” of the column on Mage.
Column creation using Mage
We perform feature engineering by creating a new column where we add the “Applicant_Income” and “Coapplicant_Income” together and store it in a new column “Total_Income”.
Removing Columns Using Mage
Let’s remove the unwanted columns not affecting the target variable “loan_status” before training the model to have better accuracy.
Model Training using Mage
After training the model on Mage the accuracy is 85% with average performance. The features which influence the prediction of the results are “Credit_History” and “Property_Area” of “ Semiurban” regions. There are also other features that have an influence on the weight for the prediction of results.
Retraining Model Using Mage
We can also retrain the model by removing some features by creating and comparing the versions to understand the improvement between the two versions.
Now on retraining the model the accuracy comes to 86% with average performance. The confusion matrix on Mage correctly predicts “85” for “1 (Yes)” and correctly predicts “20” for “0 (No)”.
Predicting Output using Mage
Let’s check how accurately our Mage model predicts the output of loan approval based on “credit_history”.
We observe that our trained model on Mage correctly predicts the target variable for loan approval based on the “Credit_History” of the applicants.
Model Training by Supreme using Mage
We can train the model using “supreme” training sessions to improve the accuracy which gives us more reliable production-ready predictions.
After training the model with a “supreme” session the accuracy achieved was 81% with average performance but the precision increased to 89.62% with excellent performance. So now we’ve automated the process of loan approval for the “Dream Housing Finance” company and provided a low code solution with Mage ” to predict loan approval status using Credit_History and Property_Area on Mage”. Go ahead and try your model on Mage.
Want to see the code? Check the “loan prediction analysis” code in the
.
Want to learn more about machine learning (ML)? Visit
! ✨🔮