Introduction To Data Science Lab Manual
Introduction To Data Science Lab Manual
Introduction To Data Science Lab Manual
Lab Manual
Name
SRN
Branch
Semester
Section
Academic Year
Introduction to Data Science Lab REVA University
Index
Sl.No. Particulars Page no.
1 Continuous Assessment Form 3
2 Semester End Examination Practical evaluation 5
procedure 2020-21
3 Course Description 6
4 Course Objectives 6
5 Course Outcomes 6
6 List of experiments 7
7 Part-A- Experiments 10
10 Appendix-Installation Guide 70
2
Introduction to Data Science Lab REVA University
CONTINUOUS ASSESMENT
PART-A
3 a.
5.
10
11
12
Total Marks 10
3
Introduction to Data Science Lab REVA University
PART-B
Total Marks 10
4
Introduction to Data Science Lab REVA University
5
Introduction to Data Science Lab REVA University
COURSE Description
Data Science is an interdisciplinary, problem-solving oriented subject that is used to apply scientific
techniques to practical problems. The course orients on preparation of datasets and programming of data
analysis tasks. This course covers the topics: ML algorithms, SQL and demonstration of experiments by
using MS-Excel and MySQL
COURSEOUTCOMES(COs)
After the completion of the course, the student will be able to:
CO1 Make use of the concepts of Data Science in developing the real 1, 2, 4,10 1,2,3
world applications.
CO2 Apply the SQL commands in developing the real-world applications. 1,2, 3,9,10 2, 3
CO3 Build the solutions for real world problems, perform analysis, 2,3, 4, 8,9, 10 1, 2, 3
interpretation and reporting of data using regression alogorithms.
6
Introduction to Data Science Lab REVA University
TABLE OF CONTENTS
25
Managers model costs in order to make predictions. The cost data in the
data file INDIRECT COSTS AND MACHINE HOURS show the
indirect manufacturing costs of an ice-skate manufacturer. Indirect
manufacturing costs include maintenance costs and setup costs. Indirect
manufacturing costs depend on the number of hours the machines are
used, called machine hours. Based on the data for January to December, Perform
perform the following operations. predictio
n and
3 MS Excel
i) Plot a scatter diagram. visualiza
ii) Determine the regression equation. tion of
iii) Plot the regression line (hint: use MS Excel's Add Trendline data
feature).
iv) Compute the predicted indirect manufacturing costs for 300
machine hours and for 430 machine hours.
v) Compute the coefficient of determination and the coefficient of
correlation
7
Introduction to Data Science Lab REVA University
32
Apply multiple linear regression to predict the stock index price which is
a dependent variable of a fictitious economy based on two independent /
input variablesinterest rateand unemployment rate.
Perform
predictio
n and
4 MS Excel
visualiza
interest unemployment stock index tion of
year month
rate rate price data
36
Calculate the total interest paid on a car loan which has been availed from
HDFC bank. For example, Rs.10,00,000 has been borrowed from a bank
with annual interest rate of 5.2% and the customer needs to pay every
month as shown in table below. Calculate the total interest rate paid for a
Sl No. A B
Create
1 Principal Rs.10,00,000 Excel
2 Annual interest rate 5.20% data and
5. MS Excel
3 Year of the loan 3 perform
EMI
4 Starting payment number 1 estimator
8
Introduction to Data Science Lab REVA University
Use the linear regression technique to compare the age of humans with 50
Apply
the amount of sleep in hours.
Linear
9 Name Age in Years Sleep in hours MS Excel
regressio
Create your own database with above details.
n
Apply the linear regression, compare the average salaries of batsman Apply 54
10 depending on the run rate scored/ recorded in the matches.Assume your MS Excel Linear
own database. regressio
n
Design the ER diagram and create schema of the REVA library Entity Entity 58
11 Relations
management system. Relationship
hip
Design the ER diagram and create schema for Hospital Management Entity diagrams
Schema 60
12
system. Relationship design
PART_B:Projects
Tools Expected Page No.
No Titleof the Experiment andTechniq Skill/Ability
ues
Apply Linear 63
1 Big Mart sales forecasting MS Excel regression
Apply Linear 66
2 Bangalore crime analysis MS Excel regression
9
Introduction to Data Science Lab REVA University
PART-A
Experiments
10
Introduction to Data Science Lab REVA University
1. The height(in cm) of a group of fathers and sons are given below,Find the lines of regression and
estimate the height of son when the height of father is 164 cm.
Hgt of
158 166 163 165 167 170 167 172 177 181
Fathers
Hgt of
163 158 167 170 160 180 170 175 172 175
Sons
Solution:
Step 2: Enter the data for Independent Variable x and dependent variable y as shown above 11 values
11
Introduction to Data Science Lab REVA University
12
Introduction to Data Science Lab REVA University
13
Introduction to Data Science Lab REVA University
Step 7: We will get the output as shown with residual plot graph
14
Introduction to Data Science Lab REVA University
Y=A+Bx
15
Introduction to Data Science Lab REVA University
=66.11417+164*0.610236
=166.1929
Manual calculation
((sum(x-
mean(x) *y-
mean(y)) ) /
coefficient of (x-
regression=byx mean(x)^2) 0.61023622 0.61023622
A=mean(y)-
calculate y intercept byx*mean(x) 66.11417323
dependent
variable=y=A+byx*x 166.192913
16
Introduction to Data Science Lab REVA University
2. Using the data file DISPOSABLE INCOME AND VEHICLE SALES, perform the following:
vi) Plot a scatter diagram.
vii) Determine the regression equation.
viii) Plot the regression line (hint: use MS Excel's Add Trendline feature).
ix) Compute the predicted vehicle sales for disposable income of $16,500 and of $17,900.
Compute the coefficient of determination and the coefficient of correlation
disposable vehicle
income sales
15000 200
28000 300
18000 180
19000 190
22000 220
25000 250
10000 100
29000 285
a)
0
0 10000 20000 30000 40000
-10
-20
15000
y=A+Bx
17
Introduction to Data Science Lab REVA University
c)Regression line:
5
0
0 5000 10000 15000 20000 25000 30000 35000
-5
-10
-15
15000
d)
Step 1: Enter the data for Independent Variable x and dependent variable y as shown above 8 values
18
Introduction to Data Science Lab REVA University
19
Introduction to Data Science Lab REVA University
20
Introduction to Data Science Lab REVA University
Step 7: We will get the output as shown with residual plot graph
21
Introduction to Data Science Lab REVA University
22
Introduction to Data Science Lab REVA University
r=∑(xi−¯x)(yi−¯y ) / √ ∑(xi−¯x)2∑(yi−¯y)2
SSR=∑(y^−y¯)2
SSE=∑(yi−y^)2
SSTO=∑(y−y¯)2
r2=SSR/SSTO=1−SSE/SSTO
244.140
15000 200 -5750 -15.625 33062500 6 89843.75
7119.14
28000 300 7250 84.375 52562500 1 611718.8
1269.14
18000 180 -2750 -35.625 7562500 1 97968.75
656.640
19000 190 -1750 -25.625 3062500 6 44843.75
19.1406
22000 220 1250 4.375 1562500 3 5468.75
1181.64
25000 250 4250 34.375 18062500 1 146093.8
13369.1
10000 100 -10750 -115.625 115562500 4 1242969
4812.89
29000 285 8250 69.375 68062500 1 572343.8
299500000. 28671.8
sum(x) sum(y) 0.00 0 00 8 2811250
166000 1725
mean(x) mean(y)
20750 215.625
((sum(x-
mean(x) *y-
coefficient of mean(y)) ) /
regressionbyx= (x- 0.01
23
Introduction to Data Science Lab REVA University
mean(x)^2)
A=mean(y)-
calculate y byx*mean(x 20.855592
intercept ) 65
dependent
variable=y=A+by 175.73247 when When Y=188.87
x*x 08 x=16500 x=17900 35
correlation 0.959341
determination=SS -20601.0189
R
SSTO= 28671.88
-0.71851
So the answer is
24
Introduction to Data Science Lab REVA University
3.Managers model costs in order to make predictions. The cost data in the data file INDIRECT
COSTS AND MACHINE HOURS show the indirect manufacturing costs of an ice-skate
manufacturer. Indirect manufacturing costs include maintenance costs and setup costs. Indirect
manufacturing costs depend on the number of hours the machines are used, called machine hours.
Based on the data for January to December, perform the following operations.
Solution:
Step 1:
no. of
hours
machine machine
month used cost
Jan 50 100
Feb 350 700
Mar 100 200
Apr 400 800
May 150 300
Jun 450 900
Jul 200 400
Aug 500 1000
Sept 250 500
Oct 550 1100
Nov 300 600
Dec 600 1200
25
Introduction to Data Science Lab REVA University
0.5
0
0 200 400 600 800
X Variable 1
y=A+Bx
c)Regression line:
0.5
0
0 200 400 600 800
X Variable 1
d)
26
Introduction to Data Science Lab REVA University
Step 1: Enter the data for Independent Variable x and dependent variable y as shown above 12 values
27
Introduction to Data Science Lab REVA University
28
Introduction to Data Science Lab REVA University
Step 6: We will get the output as shown with residual plot graph
29
Introduction to Data Science Lab REVA University
The value of cost when the machine hours are 300 and 430 are:
Y300=0+300*2
=600
Y430=0+430*2
=860
r=∑(xi−¯x)(yi−¯y ) / √ ∑(xi−¯x)2∑(yi−¯y)2
SSR=∑(y^−y¯)2
SSE=∑(yi−y^)2
SSTO=∑(y−y¯)2
r2=SSR/SSTO=1−SSE/SSTO
Manual Calculation:
30
Introduction to Data Science Lab REVA University
325 650
((sum(x-
mean(x) *y-
mean(y)) ) /
coefficient of (x-
regressionbyx= mean(x)^2) 2.00
A=mean(y)-
byx*mean(x
calculate y intercept ) 0
dependent
variable=y=A+byx* when When
x 860 x=430 X=300 Y=600
correlation 1
determination=SSR 160
1430000
SSTO=
31
Introduction to Data Science Lab REVA University
4. Apply multiple linear regression to predict the stock index price which is a dependent variable
of a fictitious economy based on two independent / input variablesinterest rateand unemployment
rate.
stock
mont interes unemploymen
year index
h t rate t rate
price
2020 10 2.75 5.3 1464
Solution:
stock
interest unemployment
year month index
rate rate
price
2020 10 2.75 5.3 1464
2020 1 2.5 5.1 1300
2020 2 2.4 5 1200
2020 3 2 4.5 1000
2020 4 3 5.5 1500
2020 6 3.5 5.7 1650
2020 8 2.9 5 1600
2020 9 2 5 1400
Step 2: Enter the data for Independent Variable x and dependent variable y as shown above 8 values
32
Introduction to Data Science Lab REVA University
33
Introduction to Data Science Lab REVA University
34
Introduction to Data Science Lab REVA University
Step 7: We will get the output as shown with residual plot graph
35
Introduction to Data Science Lab REVA University
5. Calculate the total interest paid on a car loan which has been availed from HDFC bank. For
example, Rs.10,00,000 has been borrowed from a bank with annual interest rate of 5.2% and
the customer needs to pay every month as shown in table below. Calculate the total interest
rate paid for a loan availed of Rs.10,00,000during 3 years.
Sl
A B
No.
1 Principal Rs.10,00,000
2 Annual interest rate 5.20%
3 Year of the loan 3
Starting payment
4 1
number
Ending payment
5 36
number
total interest paid during
6 ?
period
One of the easiest ways to calculate the EMI on your loan is by using Microsoft Excel. Excel provides a
simple formula for this purpose: PMT (rate, nper, pv, [fv], [type]).
PMT stands for Payment – and gives periodic loan payment or EMI value as an output. The arguments
required for this function are:
36
Introduction to Data Science Lab REVA University
Sl
No. A B
1 Principal Rs.10,00,000
Annual interest
2 rate 5.20%
3 Year of the loan 3
Starting payment
4 number 1
Ending payment
5 number 36
total interest paid
6 during period ($90,541.68)
37
Introduction to Data Science Lab REVA University
Solution:
mysql> insert into supplier values("333" ,"anthnony" ,"pencil" ,"20" ,"peenya" ,"bangalore" ,12345688
,560057);
mysql> insert into supplier values("666" ,"manyata" ,"sharpner" ,"25" ,"hebbal", "bangalore" ,99699988
,650078);
mysql> insert into supplier values("888" ,"jennifer" ,"sharpner" ,"25" ,"hebbal" ,"bangalore" ,89699988
,789098);
38
Introduction to Data Science Lab REVA University
+-------------+---------------+------------+----------+----------+-----------+----------+---------+
+-------------+---------------+------------+----------+----------+-----------+----------+---------+
+-------------+---------------+------------+----------+----------+-----------+----------+---------+
39
Introduction to Data Science Lab REVA University
40
Introduction to Data Science Lab REVA University
41
Introduction to Data Science Lab REVA University
42
Introduction to Data Science Lab REVA University
+-------------+---------------+-----------+----------------------+---------+-----------+---------+
+-------------+---------------+-----------+----------------------+---------+-----------+---------+
43
Introduction to Data Science Lab REVA University
+-------------+---------------+-----------+----------------------+---------+-----------+---------+
+-------------+---------------+-----------+----------------------+---------+-----------+---------+
+-------------+---------------+-----------+----------------------+---------+-----------+---------+
44
Introduction to Data Science Lab REVA University
+-------------+---------------+-----------+----------------------+---------+-----------+---------+
45
Introduction to Data Science Lab REVA University
8. Apply linear regression to find the weather (temperature) of a city with the amount of rain
in centimeters. Create your own database with following details.
CITY Temperature Rain in
in Centigrade Centimeters
Solution:
Step 2: Enter the data for Independent Variable x and dependent variable y as shown above 11 values
46
Introduction to Data Science Lab REVA University
47
Introduction to Data Science Lab REVA University
Step 7: We will get the output as shown with residual plot graph
48
Introduction to Data Science Lab REVA University
49
Introduction to Data Science Lab REVA University
9. Use the linear regression technique to compare the age of humans with the amount of sleep in
hours.
Name Age in Years Sleep in hours
Create your own database with above details.
Solution:
Step 2: Enter the data for Independent Variable x and dependent variable y as shown above 11 values
50
Introduction to Data Science Lab REVA University
51
Introduction to Data Science Lab REVA University
Step 7: We will get the output as shown with residual plot graph
52
Introduction to Data Science Lab REVA University
53
Introduction to Data Science Lab REVA University
10. Apply the linear regression, compare the average salaries of batsman depending on the run rate
scored/ recorded in the matches. Assume your own database.
Solution:
Step 2: Enter the data for Independent Variable x and dependent variable y as shown above 11 values
54
Introduction to Data Science Lab REVA University
55
Introduction to Data Science Lab REVA University
Step 7: We will get the output as shown with residual plot graph
56
Introduction to Data Science Lab REVA University
57
Introduction to Data Science Lab REVA University
11. Design the ER diagram and create schema of the REVA library management system.
Staff
Name Staff_id
Readers
User_id Name Phone no Email Address
Books
ISBN Category AuthNo Title Edition Price
Authentication_System
Login_id Password
Publisher
58
Introduction to Data Science Lab REVA University
Reports
User_id Register_no Book_no Issue/Return
59
Introduction to Data Science Lab REVA University
12. Design the ER diagram and create schema for Hospital Management system.
Patient
Doctor
Doc_id Doc_name
Record
Assistant
Batch_no Ward_no
Test
Test_no Test_type
60
Introduction to Data Science Lab REVA University
Account
Acc_id Description
61
Introduction to Data Science Lab REVA University
PART-B
Mini-Project
62
Introduction to Data Science Lab REVA University
Data Science plays a huge role in forecasting sales and risks in the retail sector. Majority of the leading
retail stores implement Data Science to keep a track of their customer needs and make better business
decisions. Big Mart is one such retailer.
Problem Statement: To analyze the Big Mart Sales Data set in order to predict department-wise sales for
each of their stores.
Data Set Description: The data set used for this project contains historical training data, which covers
sales details from 2010-02-05 to 2012-11-01. For the analysis of this problem, the following predictor
variables are used:
By studying the dependency of these predictor variables on the response variable, you can predict or
forecast sales for the upcoming months.
Logic:
1. Import the Data Set: The data set needed for this project can be downloaded from Kaggle.
2. Data Cleaning: In this stage, you must make sure to get rid of all inconsistencies, such as missing
values and any redundant variables.
3. Data Exploration: At this stage, you can plot boxplots and qplots to understand the significance of
each predictor variables.
4. Data Modelling: For this particular problem statement, since the outcome is a continuous variable
(Number of sales), it is reasonable to build a Regression model. The Linear Regression algorithm
can be used to solve such problems since it is specifically used to predict continuous dependent
variables.
5. Validate the model: At this stage, you should evaluate the efficiency of the data model by using
the testing data set
Solution:
The Big Mart sales obtained from Kaggle website is as shown below:
Test .csv
63
Introduction to Data Science Lab REVA University
2. Data Cleaning:
project1.xlsx
3. Data Exploration:
100
50
Residuals
0
0 5 10 15 20 25
-50
-100
20.75
Using the Regression tool of excel , the dataset obtained is analysed as shown below . The y intercept and
Coefficient of regression values are as shown in the below snapshot.
64
Introduction to Data Science Lab REVA University
65
Introduction to Data Science Lab REVA University
With the increase in the number of crimes taking place in Bangalore, law enforcement agencies are trying
their best to understand the reason behind such actions. Analyses like these can not only help understand
the reasons behind these crimes, but they can also prevent further crimes.
Problem Statement: To analyze and explore the Bangalore Crime data set to understand trends and
patterns that will help predict any future occurrences of such felonies.
Data Set Description: The dataset used for this project consists of every reported instance of a crime in the
city of Bangalore from 01/01/2001
For this analysis, the data set contains many predictor variables such as:
Logic:
Like any other Data Science project, the below-described series of steps are followed:
1. Import the Data Set: The data set needed for this project can be downloaded from Kaggle.
2. Data Cleaning: In this stage, you must make sure to get rid of all inconsistencies, such as missing
values and any redundant variables.
3. Data Exploration: You can begin this stage by translating the occurrence of crimes into plots on a
geographical map of the city. Graphically studying each predictor variable will help you understand
which variables are essential for building the model.
4. Data Modelling: For this particular problem statement, since the nature of crimes varies, it is
reasonable to build a clustering model. K-means is the most suitable algorithm for this analysis
since it is easy to build clusters using k-means.
5. Analyzing patterns: Since this problem statement requires you to draw patterns and insights about
the crimes, this step mainly involves creating reports and drawing conclusions from the data model.
6. Validate the model: At this stage, you should evaluate the efficiency of the data model by using
the testing data set and finally calculate the accuracy of the model by using a confusion matrix.
Solution:
The Bangalore Crime dataset obtained from Kaggle website is as shown below:
66
Introduction to Data Science Lab REVA University
p2.xlsx
2. Data Cleaning:
p2.xlsx
3. Data Exploration:
0
0 1000 2000 3000 4000
-20000000
-40000000
X Variable 1
Using the Regression tool of excel , the dataset obtained is analysed as shown below . The y intercept and
Coefficient of regression values are as shown in the below snapshot.
67
Introduction to Data Science Lab REVA University
68
Introduction to Data Science Lab REVA University
Additional Projects
1. Walmarts Sales forcast
2. Movie Recommendation Engine
3. Text Mining
4. Chicago crime Analysis
69
Introduction to Data Science Lab REVA University
Appendix
Installation Guide
70
Introduction to Data Science Lab REVA University
Microsoft Excel is a spreadsheet program that is used to record and analyse numerical data. Think of a
spreadsheet as a collection of columns and rows that form a table. Alphabetical letters are usually assigned
to columns and numbers are usually assigned to rows. The point where a column and a row meet is called a
cell. The address of a cell is given by the letter representing the column and the number representing a row.
Running Excel is not different from running any other Windows program. If you are running Windows
with a GUI like (Windows XP, Vista, and 7) follow the following steps.
71
Introduction to Data Science Lab REVA University
Statistical analysis such as descriptive statistics and regression requires the Excel Data Analysis add-in.
The default configuration of Excel does not automatically support descriptive statistics and regression
analysis.
You may need to add these to your computer (a once-only operation).
Excel 2007: The Data Analysis add-in should appear at right-end of Data menu as Data Analysis.
If not then
72
Introduction to Data Science Lab REVA University
9. Click the Microsoft Office Button, and then click Excel Options.
2. Click Add-Ins, and then in the Manage box, select Excel Add-ins.
73
Introduction to Data Science Lab REVA University
74
Introduction to Data Science Lab REVA University
If you get prompted that the Analysis ToolPak is not currently installed on your computer,
75
Introduction to Data Science Lab REVA University
13. After you load the Analysis ToolPak, the Data Analysis command is available in the Analysis
group on the Data tab.
This is one time procedure to be followed to add on the Data Analysis button to the Excel. After this
procedure is done.
76