Kailash BusinessReport
Kailash BusinessReport
Kailash BusinessReport
MODELLING
PROBLEM 1: LINEAR REGRESSION
As you are a budding data scientist you thought to find out a linear equation to build a
model to predict 'usr'(Portion of time (%) that cpus run in user mode) and to find out how
each attribute affects the system to be in 'usr' mode using a list of system attributes.
SUMMARY
The Given dataset contains data collected from sun sparctation 20/712 with 128
Mbytes of memory running in a multi-user university department.
A Model need to find out for predicting ‘usr’ and check how each attribute affects the
system to be in 'usr' mode using a list of system attributes.
Dataset sample
lwrite scall sread swrite fork exec rchar wchar pgout ... pgscan atch pgin ppgin pflt vflt runqsz freemem freeswap usr
lread
0 1 0 2147 79 68 0.2 0.2 40671.0 53995.0 0.0 ... 0.0 0.0 1.6 2.6 16.00 26.40 CPU_Bound 4670 1730946 95
1 0 0 170 18 21 0.2 0.2 448.0 8385.0 0.0 ... 0.0 0.0 0.0 0.0 15.63 16.83 Not_CPU_Bound 7278 1869002 97
2 15 3 2162 159 119 2.0 2.4 NaN 31950.0 0.0 ... 0.0 1.2 6.0 9.4 150.20 220.20 Not_CPU_Bound 702 1021237 87
3 0 0 160 12 16 0.2 0.2 NaN 8670.0 0.0 ... 0.0 0.0 0.2 0.2 15.60 16.80 Not_CPU_Bound 7248 1863704 98
4 5 1 330 39 38 0.4 0.4 NaN 12185.0 0.0 ... 0.0 0.0 1.0 1.2 37.80 47.60 Not_CPU_Bound 633 1760253 90
5 rows × 22 columns
lread 0
lwrite 0
scall 0
sread 0
swrite 0
fork 0
exec 0
rchar 104
wchar 15
pgout 0
ppgout 0
pgfree 0
pgscan 0
atch 0
pgin 0
ppgin 0
pflt 0
vflt 0
runqsz 0
freemem 0
freeswap 0
usr 0
dtype: int64
There is no duplicated column, data set doesn’t have duplicate rows as well
Data set doesn’t have null values as well except rchar and wchar columns.
Let us use the ‘For loop ’ to treat these null values by replace with median values.
lread 0
lwrite 0
scall 0
sread 0
swrite 0
fork 0
exec 0
rchar 0
wchar 0
pgout 0
ppgout 0
pgfree 0
pgscan 0
atch 0
pgin 0
ppgin 0
pflt 0
vflt 0
runqsz 0
freemem 0
freeswap 0
usr 0
After the treatment null values in the data set was clear, no disturbance in data set.
Linear regression sensitive to the null values.
ENCODING:
Linear regression model requires only numerical values, but the data set have one
object variable ,we can encode the object as numerical variable
In data set there is a column ‘runqsz’ as object data type.
Now Converting the columns as numerical by using the Label encoding method and
replacing the ‘Cpu_bound’ as 1 and ‘Notcpu_bound’ as 2.
OUTLIERS:
Every column having the outliers. As the Linear regression is sensitive for outliers,
but in my opinion is outliers treatment is not quite good because each and every data is
unique with his own entry.
And Treating the outliers will affect the original value of the data and it may leads
to wrong prediction also. So, we will proceed the data with the outliers.
Here in every column ‘0’ place an important role as its showing huge difference in
the range of the data.
If we treat the 0 ,there will be change in data also (like null values) as the real data
may have 0, so we will proceed with these.
PAIRPLOT
Pairplot shows the relationship between the variables in the form of scatterplot and
the distribution of the variable in the form of histogram
As the given data set contains huge numbers of columns the pair plot is looking
little messy.
And as the plot we can see some columns having the positive correlation b/w them.
Some having no correlation and some columns have negative correlation as well.
Now Let us split the data and build a model to proceed.
[5 rows x 22 columns]
X_test data having the follows,
const lread lwrite scall sread swrite fork exec rchar \
3894 1.0 27 39 1252 53 118 0.2 0.2 26592.0
4276 1.0 1 0 996 85 55 0.4 0.4 16667.0
3414 1.0 9 7 1530 247 135 0.4 0.4 14513.0
4165 1.0 32 4 3243 182 140 5.2 5.6 337517.0
7385 1.0 16 3 5017 259 249 2.8 1.4 73537.0
[5 rows x 22 columns]
As the Train and the test data split up we can process with creating the linear
model. Now for creating the OLS model, we can use the .ols from stats model api package.
And Fit the data with x_train and y_train.
Now the summary of the linear regression as follows,
OLS Regression Results
==============================================================================
Dep. Variable: usr R-squared: 0.643
Model: OLS Adj. R-squared: 0.642
Method: Least Squares F-statistic: 489.6
Date: Mon, 05 Dec 2022 Prob (F-statistic): 0.00
Time: 16:53:24 Log-Likelihood: -21788.
No. Observations: 5734 AIC: 4.362e+04
Df Residuals: 5712 BIC: 4.377e+04
Df Model: 21
Covariance Type: nonrobust
===================================================================================
=====
coef std err t P>|t| [0.025
0.975]
-----------------------------------------------------------------------------------
-----
const 44.6380 0.746 59.831 0.000 43.175
46.101
lread -0.0199 0.003 -6.214 0.000 -0.026 -
0.014
lwrite 0.0048 0.006 0.795 0.427 -0.007
0.017
scall 0.0010 0.000 7.451 0.000 0.001
0.001
sread -0.0005 0.002 -0.257 0.797 -0.004
0.003
swrite -0.0020 0.002 -1.018 0.309 -0.006
0.002
fork -1.7222 0.244 -7.052 0.000 -2.201 -
1.244
exec -0.0896 0.048 -1.879 0.060 -0.183
0.004
rchar -4.062e-06 8.29e-07 -4.898 0.000 -5.69e-06 -
2.44e-06
wchar -1.164e-05 1.28e-06 -9.118 0.000 -1.41e-05 -
9.14e-06
pgout -0.1739 0.064 -2.717 0.007 -0.299 -
0.048
ppgout 0.0989 0.037 2.701 0.007 0.027
0.171
pgfree -0.0703 0.020 -3.508 0.000 -0.110 -
0.031
pgscan 0.0086 0.006 1.362 0.173 -0.004
0.021
atch -0.0786 0.027 -2.949 0.003 -0.131 -
0.026
pgin 0.0913 0.029 3.103 0.002 0.034
0.149
ppgin -0.0594 0.019 -3.128 0.002 -0.097 -
0.022
pflt -0.0415 0.004 -9.697 0.000 -0.050 -
0.033
vflt 0.0223 0.003 6.665 0.000 0.016
0.029
freemem -0.0016 7.53e-05 -21.489 0.000 -0.002 -
0.001
freeswap 3.219e-05 4.54e-07 70.985 0.000 3.13e-05
3.31e-05
runqsz_Not_CPU_Bound 7.7908 0.303 25.693 0.000 7.196
8.385
==============================================================================
Omnibus: 1507.319 Durbin-Watson: 2.057
Prob(Omnibus): 0.000 Jarque-Bera (JB): 4768.238
Skew: -1.333 Prob(JB): 0.00
Kurtosis: 6.585 Cond. No. 7.48e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[2] The condition number is large, 7.48e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
The R-square value tells that the model can explain 64.3% of the variance in the
training set.
Adjusted R-square also nearly to the R-square,64.2%.
Let’s build another model.
Df Model: 21
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.48e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
As we see the new generated model is same as previous model with also R square
and Adjusted R square are same.
With these data we can see there is strong correlation between the data of Y_test
and the y_prediction
CONCLUSION:
When number of faults increases, ‘usr’ also getting increase by 0.02% and rest of all
are in negative value.
There are so many negative co-efficient are present in linear equation.
Except ‘vflt’ and ‘rungsz’ all co-efficient are decrease when implies.
Totally model was not good enough to predict the future data set as the Outliers
dependent is more.
Even including the ‘0’ as the data the linear regressin model sensitive for the
outliers,if we try to remove these 0, then the information from the data will change.
PROBLEM 2: LOGISTIC REGRESSION, LDA AND CART
You are a statistician at the Republic of Indonesia Ministry of Health and you are
provided with a data of 1473 females collected from a Contraceptive Prevalence Survey. The
samples are married women who were either not pregnant or do not know if they were at the
time of the survey.
The problem is to predict do/don't they use a contraceptive method of choice based on
their demographic and socio-economic characteristics.
SUMMARY
The Given dataset contains data of 1473 females collected from a Contraceptive
Prevalence Survey. And the not sure if they were pregnant or not at the time of the
survey.
The Model need predict do/don't they use a contraceptive method of choice based
on their demographic and socio-economic characteristics.
Importing the required libraries regarding logistic regression, LDA and CART .
Reading the excel file.
As per the summary the excel file contains data of data of 1473 females collected
from a Contraceptive Prevalence Survey.
Dataset sample
By using describe (include=all) we can find NaN values mostly due only 3 variables are
in numeric .
EDA
In the given dataset , there are 80 rows of data showing as Duplicates . But it may
be certain valued data like different persons with having same qualifications or talent, i.e ,
similar data found in the given dataset. And the Displayed Duplicated data represents the
information from different person only as atleast one variable showed different data.
In the Python the entries with similar data will be considered as Duplicates even its
valued information. And we don’t need to treat or drop those data as it may contain the
information. And they have similarity only in husband education, wife region, standard of
living and media exposure.
By checking the null values, we got the variable ‘Wife_age’ and the ‘No of children
born’ having null values.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1473 entries, 0 to 1472
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Wife_age 1402 non-null float64
1 Wife_ education 1473 non-null object
2 Husband_education 1473 non-null object
3 No_of_children_born 1452 non-null float64
4 Wife_religion 1473 non-null object
5 Wife_Working 1473 non-null object
6 Husband_Occupation 1473 non-null int64
7 Standard_of_living_index 1473 non-null object
8 Media_exposure 1473 non-null object
9 Contraceptive_method_used 1473 non-null object
dtypes: float64(2), int64(1), object(7)
memory usage: 115.2+ KB
Now let us treat the null values only for the wife age, because as the time of survey
the given dataset not sure about the married women are Pregnant or the children born or
not at that time. So, we can’t predict those data. But we can predict the wife age but using
the mean and median values.
By changing the nullvalues of Wife age as the mean by using the ‘for loop’ option
And we can drop the rows those ‘No of children born’ not known as the least no of data.
Now we can get the 0 Null value dataset, as some data gets drops we get 1452 data as
rows.
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1452 entries, 0 to 1472
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Wife_age 1452 non-null float64
1 Wife_ education 1452 non-null object
2 Husband_education 1452 non-null object
3 No_of_children_born 1452 non-null float64
4 Wife_religion 1452 non-null object
5 Wife_Working 1452 non-null object
6 Husband_Occupation 1452 non-null int64
7 Standard_of_living_index 1452 non-null object
8 Media_exposure 1452 non-null object
9 Contraceptive_method_used 1452 non-null object
dtypes: float64(2), int64(1), object(7)
memory usage: 124.8+ KB
Before we move onto the plots with the object variables, we can change these
object as numeric labelling by using the unique codes . Because our model created by
Logistics Regression , LDA and the CART will not work on object variables.
As the plot informed, some variables having outliers .But we not going to treat those
outliers as it may help us to predict the model by this in formation. For example,
Wife_ education
Tertiary 570
Secondary 405
Primary 327
Uneducated 150
Name: Wife_ education, dtype: int64
Husband_education
Tertiary 889
Secondary 346
Primary 173
Uneducated 44
Name: Husband_education, dtype: int64
Wife_religion
Scientology 1235
Non-Scientology 217
Name: Wife_religion, dtype: int64
Wife_Working
No 1089
Yes 363
Name: Wife_Working, dtype: int64
The Pair plot shows how the relationship between the every variables. Among these
variables we need to predict those models that they use the contraceptive method or not.
Logistic Regression
Train and Test Split:
Let us create the x and y variable data with respect to ‘'Contraceptive_method_used'’
column as the target variable. Now x having every data except the target variable and y
having only the target variable .
Before we proceed the process , we need to import the required libraries or checking
it. In this encoding for ‘'Contraceptive_method_used'’ 1 as yes and 0 as No.
As we already label encoding the object variables , there is no necessary to use Label
Encoder form sklearn library.
The encoding is for creating the dummy variables .
Now the Train set and the test set has been spittedup by using the sklearn
model.Using logistic regression model method to fit the data and creating a logistic
model.
The propotion of 1s and 0s i.e., (Customers using Contraceptive_method_used Yes /
No) as follows,
1 0.566804
0 0.433196
Now we need to fit the Logistic regression model by using newton cg as solver, 1000
as max itre(iteration), then we get the predicted dataframe of the model,
1
0
0 0.363066 0.636934
1 0.342266 0.657734
2 0.471333 0.528667
3 0.338485 0.661515
4 0.309265 0.690735
In the above dataframe we can see ,’1’ gets the higher probability of 69% and the
Model Accuracy is 66.73%(0.6673)
In this curve, if the plot occurs below the dotted lines, then it accept as worst model
ever, Eventhough the curve is not perfect but the curve is OK ,the AUC(Area under
the curve) of the train data model is 67.10%.
For the comparision the train data AUC with the test data AUC, mostly both curve is
similar with some variation only as the AUC of both is same as 67.10%. Lets move to the
confusion matrix,
By checking up the confusion matrix of the train data, we can get the value of True
Positive as 182 and the True Negative as 489.
array([[182, 247],
[ 91, 496]], dtype=int64)
This Plot shows the relationship between the True label and the predicted label as
0s and 1s.
And the classification of the report as follows,
precision recall f1-score support
Precision (67%) – 67% of married women predicted are actually not using
Contraceptive method out of all married women predicted to not using
Contraceptive method.
Recall (42%) – Out of all the married women not using Contraceptive method , 46%
of married women have been predicted correctly .
Precision (67%) – 67% of married women predicted are actually using Contraceptive
method out of all married women predicted to be using Contraceptive method .
Recall (84%) – Out of all the married women actually using contraceptive method ,
84% of married women have been predicted correctly .
And the Accuracy is 67% which is more than 50%, so the model is Good.
Confusion matrix for train data:
By checking up the confusion matrix of the train data, we can get the value of True
Positive as 75 and the True Negative as 208.
Precision (73%) – 73% of married women predicted are actually not using
Contraceptive method out of all married women predicted to not using
Contraceptive method.
Recall (38%) – Out of all the married women not using Contraceptive method , 38%
of married women have been predicted correctly .
Precision (62%) – 62% of married women predicted are actually using Contraceptive
method out of all married women predicted to be using Contraceptive method .
Recall (88%) – Out of all the married women actually using contraceptive method ,
88% of married women have been predicted correctly .
And the Accuracy is 65% which is more than 50%, so the model is also Good as
Training Data.
Grid search:
By using the grid search CV from sklearn model to get predict the best model. The
process is same as the above and we get,
For Training data
As the above method , here also we get similar values as true positive as 76 and true
negative as 209. And the Accuracy is as 65%, not much differ from previous method.
Conclusion:
Overall accuracy of the model – 67 % of total predictions are correct
Accuracy, AUC, Precision and Recall for test data is almost inline with training data.
This proves no overfitting or underfitting has happened, and overall the model is a
good model for classification
LDA
Train and Test Split:
The procedure is same as the above Logistics regression for splitting the Train and
test data.
Need to import the LDA(Linear Discriminant analysis) from the sklearn library and the
results is as follows,
Classification Report of the training data:
There is some slight difference with the Training and the test data reports, but its ok
as the Accuracy of train data is as 67% and the accuracy for the test data is as 65%.
The model accuracy on the training as well as the test set is about 67%, which is
roughly the same proportion as the class 0 observations in the dataset. This model is
affected by a class imbalance problem. Since we only have 1473 observations, if re-
build the same LDA model with more number of data points, an even better model
could be built.
By choosing cut-off, We see that 0.4 and 0.5 gives better accuracy than the rest of
the custom cut-off values. But 0.4 cut-off gives us the best 'f1-score'. Here, we will
take the cut-off as 0.2 to get the optimum 'f1' score.
CART
In CART we can use the dataset with outliers as its not sensitive with outliers.
Train and Test Split:
The Same procedure as the above Logistic regression and the LDA, Train and test
data need to be splitted, and before that the necessary libraries need to be imported.
In cart , the decision tree is the most important,
Decision tree:
Fit the train and test data into decision tree. We need to create in new word
document and saved in Project folder.
Now we can copy and paste the code in http://webgraphviz.com/. For checking the
decision tree we can delete the existing codes and paste it there.
The tree will be little messy as the data contains vast information or classifications,so
we will reduce the max.leaf , max.depth of the tree and the min. sample size.
Here “GINI” ,a decision tree classifier plays the important role. And creating a new
word document with reduced branches as 30, leaf is 10 and depth is 7 and saved the
document in project folder.
Now decision tree is looking better than before.
Now Let us check the feature Importance, where Feature importance refers to
techniques that assign a score to input features based on how useful they are at predicting
a target variable.
Imp
Wife_age 0.408296
No_of_children_born 0.366101
Media_exposure 0.075275
Wife_ education 0.073313
Husband_education 0.053250
Husband_Occupation 0.010100
Standard_of_living_index 0.008617
Wife_Working 0.005049
Wife_religion 0.000000
As we see ,depend upon the ‘wife_age’ having more importance, we can slightly
predict that the contraceptive method can be used depend upon the age factors of women.
AUC PLOT
As we see the AUC curve bending high , the model will be good and its AUC
value for train data is 83.9%
Here the plot is not quite smooth , but over the area its keeping up the bend
formation and its AUC value for test data is 72.9%.
Let us move to the Confusion matrix,
FOR TRAIN DATA,
array([[282, 159],
[ 71, 504]], dtype=int64)
By checking up the confusion matrix of the train data, we can get the value of True
Positive as 282 and the True Negative as 504.
Precision (80%) – 80% of married women predicted are actually not using
Contraceptive method out of all married women predicted to not using
Contraceptive method.
Recall (64%) – Out of all the married women not using Contraceptive method , 64%
of married women have been predicted correctly .
Precision (76%) – 76% of married women predicted are actually using Contraceptive
method out of all married women predicted to be using Contraceptive method .
Recall (88%) – Out of all the married women actually using contraceptive method ,
88% of married women have been predicted correctly .
And the Accuracy is 77.3% which is more than 50%, so the model is also Good( Better
than the Logistic Regression and the LDA).
By checking up the confusion matrix of the train data, we can get the value of True
Positive as 106 and the True Negative as 198.
Precision (71%) – 71% of married women predicted are actually using Contraceptive
method out of all married women predicted to be using Contraceptive method .
Recall (80%) – Out of all the married women actually using contraceptive method ,
80% of married women have been predicted correctly .
And the Accuracy is 69.7% which is more than 50%, so the model is also Good( Better
than the Logistic Regression and the LDA).
CONCLUSION
From these above models , in Every models the Encoded label ‘1’(conceptive method
used) predicted as high and the Accuracy and the F1 score of the models also favour
for the label ’1’.
But we can’t conclude that the contraceptive method used or not , but we can
predict that the married women used the Contraceptive method as prediction and
the final prediction also showing the same things only.