Data Science - Notes
Data Science - Notes
Data Science - Notes
• Importing data into R: This typically means taking data stored in a file, database, or
web API, and loading it into a data frame in R.
• After importing data, tidy it: Tidying your data means storing it in a consistent form
that matches the semantics of the dataset with the way it is stored. In brief, when your
data is tidy, each column is a variable, and each row is an observation.
• Transformation: Transformation includes narrowing in on observations of interest
creating new variables that are functions of existing variables, and calculating a set of
summary statistics. Together, tidying and transforming are called wrangling.
• Engines of knowledge generation: visualization and modelling: Visualization is a
fundamentally human activity. A good visualization will show you things that you did
not expect, or raise new questions about the data. A good visualization might also hint
that you’re asking the wrong question, or you need to collect different data. Models
are complementary tools to visualization. Once you have made your questions
sufficiently precise, you can use a model to answer them. Models are a fundamentally
mathematical or computational tool, so they generally scale well.
• Communication: It doesn’t matter how well your models and visualization have led
you to understand the data unless you can also communicate your results to others.
2. Introduction to R Programming
The R Foundation describes R as “a language and environment for statistical computing and
graphics.”
➢ Real-Time Applications of R
R for data science is used in industries such as banking, telecommunications and media. Few
examples of data visualization in R through real-life projects are as follows:
3. Installation of R on windows
Step – 5: Run the .exe file and follow the installation instructions.
5.a. Select the desired language and then click Next.
5.d. Enter/browse the folder/path you wish to install R into and then confirm by
clicking Next.
5.e. Select additional tasks like creating desktop shortcuts etc. then click Next.
Step – 2: Click on the link for the windows version of RStudio and save the .exe file.
Step – 3: Run the .exe and follow the installation instructions.
3.a. Click Next on the welcome window.
3.b. Enter/browse the path to the installation folder and click Next to proceed.
3.c. Select the folder for the start menu shortcut or click on do not create shortcuts and then
click Next.
3.d. Wait for the installation process to complete.
c
3.e. Click Finish to end the installation.
➢ “R” Objects
An object refers to anything that can be assigned to a variable. Each object has two
attributes:
Data structures are the objects that are manipulated regularly in R. They are used to store data
in an organized fashion to make data manipulation and other data operations more efficient. It
reduces the complexities of space and time in various tasks.
➢ Vectors
Vector is one of the basic data structures in R. It is homogenous, which means that it only
contains elements of the same data type. Data types can be numeric, integer, character,
complex, or logical.
Vectors are created by using the c() function. Coercion takes place in a vector, from bottom
to top, if the elements passed are of different data types, from logical to integer to double to
character.
The typeof() function is used to check the data type of the vector, and the class() function is
used to check the class of the vector.
typeof(Vec1)
typeof(Vec2)
Output:
[1] "double"
[1] "character"
• Elements of a vector can be accessed by using their respective indexes. [ ] brackets are
used to specify indexes of the elements to be accessed.
For example:
x <- c("Jan","Feb","March","Apr","May","June","July")
y <- x[c(3,2,7)]
print(y)
Output:
For example:
x <- c("Jan","Feb","March","Apr","May","June","July")
y <- x[c(TRUE,FALSE,TRUE,FALSE,FALSE,TRUE,TRUE)]
z <- x[c(-3,-7)]
c <- x[c(0,0,0,1,0,0,1)]
print(y)
print(z)
print(c)
Output:
[1] "Jan" "Feb" "Apr" "May" "June"(All corresponding values for negative indexes are
dropped)
➢ Vector Arithmetic
You can perform addition, subtraction, multiplication, and division on the vectors having the
same number of elements in the following ways:
v1 <- c(4,6,7,31,45)
v2 <- c(54,1,10,86,14,57)
print(add.v)
print(sub.v)
print(multi.v)
print(divi.v)
Output:
[1] 58 7 17 117 59 66
If arithmetic operations are performed on vectors having unequal lengths, then a vector’s
elements, which are shorter in number as compared to the elements of other vectors, are
recycled. For example:
v1 <- c(8,7,6,5,0,1)
v2 <- c(7,15)
print(add.v)
print(sub.v)
Output:
[1] 15 22 13 20 7 16
➢ Sorting a Vector
You can sort the elements of a vector by using the sort() function in the following way:
v <- c(4,78,-45,6,89,678)
print(sort.v)
print(revsort.v)
v <- c("Jan","Feb","March","April")
print(sort.v)
print(revsort.v)
Output:
➢ Lists
A list is a non-homogeneous data structure, which implies that it can contain elements of
different data types. It accepts numbers, characters, lists, and even matrices and functions
inside it. It is created by using the list() function.
For example:
print(list1)
Output:
[[1]]
[1] "Sam"
[[2]]
[1] "Green"
[[3]]
[1] 8 2 67
[[4]]
[1] TRUE
[[5]]
[1] 51.99
[[6]]
[1] 11.78
[7]]
[1] FALSE
The elements of a list can be accessed by using the indices of those elements.
For example:
print(list2[1])
print(list2[2])
print(list2[3])
Output:
[[1]]
[1,] 3 5 -2
[2,] 9 1 8
[[1]]
[1,] 3 5 -2
[[1]]
[[1]][[1]]
[1] 3
[[1]][[2]] (Third element of the list)
[1] 4
[[1]][[3]]
[1] 5
You can add and delete elements only at the end of a list.
For example:
print(list2[4])
Output:
[[1]]
[1] "Hello"
Similarly,
print(list2[4])
Output:
[[1]]
NULL
print(list2[3])
Output:
[[1]]
Matrix is a two-dimensional data structure that is homogenous, meaning that it only accepts
elements of the same data type. Coercion takes place if elements of different data types are
passed. It is created by using the matrix() function.
where,
For example:
print(M1)
Output:
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
print(M2)
Output:
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
➢ Accessing the Elements of a Matrix
To access the elements of a matrix, row and column indices are used in the following ways:
For accessing the elements of the matrix M2 created above, use the following syntax:
print(M2[1,1])
print(M2[3,3])
print(M2[2,3])
Output:
➢ Factor
Factors are used in data analysis for statistical modeling. They are used to categorize unique
values in columns, such as “Male”, “Female”, “TRUE”, “FALSE”, etc., and store them as
levels. They can store both strings and integers. They are useful in columns that have a
limited number of unique values.
Factors can be created using the factor() function and they take vectors as inputs.
For example:
print(data)
print(factor.data)
Output:
➢ Data Frame
Data frame is a two-dimensional array-like structure that also resembles a table, in which
each column contains values of one variable and each row contains one set of values from
each column.
A data frame has the following characteristics:
You can use the following syntax for creating a data frame in R programming:
print(emp.data)
Output:
Sl.No. Empid empname Empdept
1 1 Sam Sales
2 2 Rob Marketing
3 3 Max HR
4 4 John R&D
To extract a specific column from a data frame, use the following syntax:
print(result)
Output:
Sr. No. emp.data.empname emp.data.empdept
1 Sam Sales
2 Rob Marketing
3 Max HR
4 John R&D
To extract specific rows from a data frame, use the following syntax:
print(result)
Output:
Sr. No. Empid empname Empdept
1 1 Sam Sales
2 2 Rob Marketing
The following code extracts the first and third rows with second and third columns
respectively.
print(result)
Output:
Sr. No. Empname Empdept
1 Sam Sales
2 Max HR
➢ Adding a Column to a Data Frame
To add a salary column to the above data frame, you can use the following syntax:
n <- emp.data
print(n)
3 3 Max HR 40000
To add a new row(s) to an existing data frame, you need to create a new data frame that
contains the new row(s), and then merge it with the existing data frame using the rbind()
function.
empid = c(5:7),
empname = c("Frank","Tony","Eric"),
empdept = c("IT","Operations","Finance"),
salary = c(32000,51000,45000)
➢ Merging the New Data Frame with the Existing Data Frame
print(emp.finaldata)
Output:
Sr. No. Empid Empname empdept Salary
3 3 Max HR 40000
5 5 Frank IT 32000
➢ Arrays
An Array is a Multi-Dimensional data Structure. Arrays refer to the type of data structure that
is used to store homogeneous elements. This leads to a collection of items that are stored at
contiguous memory locations. This memory location is denoted by the array name. The
position of an element can be calculated simply by adding an offset to its base value.
Syntax: array(c(elements),dimension=c(rows,cols,number_of_dimensions))
For example:
> vec1<-c(1,2,3,4,5,6)
> vec2<-c(7,8,9,10,11,12)
> a1<-array(c(vec1,vec2),dim=c(2,3,2))
> print(a1)
Output:
,,1 #Dimension1
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
,,2 #Dimension2
[1,] 7 9 11
[2,] 8 10 12
6. CONTROL STRUCTURES
➢ if Condition in R
This task is carried out only if this condition is returned as TRUE. R makes it even easier:
You can drop the word then and specify your choice in an if statement.
Syntax:
if (test_expression) {
statement
Example:
values <- 1:10
Output:
➢ if-else Condition in R
An if…else statement contains the same elements as an if statement with some extra
elements:
• The keyword else, placed after the first code block.
• The second block of code, contained within braces, that has to be carried out, only if the
result of the condition in the if() statement is FALSE.
Syntax:
if (test_expression) {
statement
} else {
statement
Example:
val1 = 10 #Creating our first variable val1
Output:
➢ For loop in R
A loop is a sequence of instructions that is repeated until a certain condition is reached. for,
while and repeat, with the additional clauses break and next are used to construct loops.
Example:
These control structures in R, made of the rectangular box ‘init’ and the diamond. It is
executed a known number of times. for is a block that is contained within curly braces.
values <- c(1,2,3,4,5)
for(id in 1:5){
print(values[id])
}
Output:
➢ while Loop in R
The format is while(cond) expr, where cond is the condition to test and expr is an
expression.
Example:
val = 2.987
print(c(val,val-2,val-1))
}
Output:
7. Functions
A function is a set of statements organized together to perform a specific task. R has a large
number of in-built functions and the user can create their own functions.
➢ Function Definition
An R function is created by using the keyword function. The basic syntax of an R function
definition is as follows −
function_name <- function(arg_1, arg_2, ...) {
Function body
}
➢ Built-in Function
Simple examples of in-built functions are seq(), mean(), max(), sum(x) and paste(...) etc.
They are directly called by user written programs.
# Create a sequence of numbers from 32 to 44.
print(seq(32,44))
➢ User-defined Function
We can create user-defined functions in R. They are specific to what a user wants and once
created they can be used like the built-in functions. Below is an example of how a function is
created and used.
# Create a function to print squares of numbers in sequence.
new.function <- function(a) {
for(i in 1:a) {
b <- i^2
print(b)
}
}
Calling a Function
install.packages("tidyverse")
Once you have installed a package, you can load it with the library() function:
library(tidyverse)
Example:
>laptops<-read.csv("C:/Users/jaideep/Downloads/DataSets-master/DataSets-
master/laptops.csv")
>View(laptops)
Output:
>library(dplyr)
> View(laptops1_2)
> View(laptops3_6)
Output:
> View(lap)
filter()
laptops %>% filter(Manufacturer=="Dell")->dell_laptop
> read.csv("abc.csv")
Output:
Input Data
Create a XMl file by copying the below data into a text editor like notepad. Save the file with
a .xml extension and choosing the file type as all files(*.*).
<RECORDS>
<EMPLOYEE>
<ID>1</ID>
<NAME>Rick</NAME>
<SALARY>623.3</SALARY>
<STARTDATE>1/1/2012</STARTDATE>
<DEPT>IT</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>2</ID>
<NAME>Dan</NAME>
<SALARY>515.2</SALARY>
<STARTDATE>9/23/2013</STARTDATE>
<DEPT>Operations</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>3</ID>
<NAME>Michelle</NAME>
<SALARY>611</SALARY>
<STARTDATE>11/15/2014</STARTDATE>
<DEPT>IT</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>4</ID>
<NAME>Ryan</NAME>
<SALARY>729</SALARY>
<STARTDATE>5/11/2014</STARTDATE>
<DEPT>HR</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>5</ID>
<NAME>Gary</NAME>
<SALARY>843.25</SALARY>
<STARTDATE>3/27/2015</STARTDATE>
<DEPT>Finance</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>6</ID>
<NAME>Nina</NAME>
<SALARY>578</SALARY>
<STARTDATE>5/21/2013</STARTDATE>
<DEPT>IT</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>7</ID>
<NAME>Simon</NAME>
<SALARY>632.8</SALARY>
<STARTDATE>7/30/2013</STARTDATE>
<DEPT>Operations</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>8</ID>
<NAME>Guru</NAME>
<SALARY>722.5</SALARY>
<STARTDATE>6/17/2014</STARTDATE>
<DEPT>Finance</DEPT>
</EMPLOYEE>
</RECORDS>
The xml file is read by R using the function xmlParse(). It is stored as a list in R.
# Load the package required to read XML files.
library("XML")
2
Dan
515.2
9/23/2013
Operations
3
Michelle
611
11/15/2014
IT
4
Ryan
729
5/11/2014
HR
5
Gary
843.25
3/27/2015
Finance
6
Nina
578
5/21/2013
IT
7
Simon
632.8
7/30/2013
Operations
8
Guru
722.5
6/17/2014
Finance
JSON stands for JavaScript Object Notation. These files contain the data in human
readable format, i.e. as text. Like any other file, one can read as well as write into the
JSON files. In order to work with JSON files in R, one needs to install
the “rjson” package. The most common tasks done using JSON files under rjson
packages are as follows:
• Install and load the rjson package in R console
• Create a JSON file
• Reading data from JSON file
• Write into JSON file
• Converting the JSON data into Dataframes
• Working with URLs
Install and load the rjson package
One can install the rjson from the R console using the install.packages() command in the
following way:
install.packages("rjson")
After installing rjson package one has to load the package using the library() function
as follows:
library("rjson")
Creating a JSON file
To create a JSON file, one can do the following steps:
Copy the data given below into a notepad file or any text editor file. One can also
create his own data as per the given format.
{
"ID":["1","2","3","4","5"],
"Name":["Mithuna","Tanushree","Parnasha","Arjun","Pankaj"],
"Salary":["722.5","815.2","1611","2829","843.25"],
"StartDate":["6/17/2014","1/1/2012","11/15/2014","9/23/2013","5/21/2013"],
"Dept":["IT","IT","HR","Operations","Finance"]
}
Choose “all types” as the file type and save the filewith .json extension.(Example:
example.json)
One must make sure that the information or data is contained within a pair or curly
braces { } .
library("rjson")
print(result)
Output:
$ID
[1] "1" "2" "3" "4" "5"
$Name
[1] "Mithuna" "Tanushree" "Parnasha" "Arjun" "Pankaj"
$Salary
[1] "722.5" "815.2" "1611" "2829" "843.25"
$StartDate
[1] "6/17/2014" "1/1/2012" "11/15/2014" "9/23/2013" "5/21/2013"
$Dept
[1] "IT" "IT" "HR" "Operations" "Finance"
library("rjson")
write(jsonData, "result.json")
print(result)
Output:
[[1]]
[1] "sunflower" "guava" "hibiscus"
[[2]]
[1] "flower" "fruit" "flower"
library(RJSONIO)
head(Names)
Output:
[[1]]
[[2]]
[[3]]
[[4]]
[[5]]
[[6]]
RMySQL Package
It is a built-in package in R and Its provides connectivity between the R and MySql
databases. It can be installed with the following commands:
install.packages("RMySQL")
Connecting MySQL with R Programming Language
R requires RMySQL package to create a connection object which takes username,
password, hostname and database name while calling the
function. dbConnect() function is used to create the connection object in R.
Syntax: dbConnect(drv, user, password, dbname, host)
Parameter values:
• drv represents Database Driver
• user represents username
• password represents password value assigned to Database server
• dbname represents name of the database
• host represents host name
Example:
# Install package
install.packages("RMySQL")
# Loading library
library("RMySQL")
# Create connection
dbListTables(mysqlconn)
Output:
<MySQLResult:1702063201,0,2>
Database content:
Selecting Data from MySql table using R
> library(readxl)
> head(Data1)
OUTPUT:
To build a ggplot, we will use the following basic template that can be used for different
types of plots:
• add ‘geoms’ – graphical representations of the data in the plot (points, lines,
bars). ggplot2 offers many different geoms; we will use some common ones today,
including:
o geom_point() for scatter plots, dot plots, etc.
o geom_boxplot() for, well, boxplots!
o geom_line() for trend lines, time series, etc.
Example:
Consider diamonds dataset for this example. Use Command View(diamonds) to view dataset
and ?diamond to get complete information about diamonds dataset.
>library(tidyverse)
> library(ggplot2)
> ggplot(data=diamonds)
> ggplot(data=diamonds,aes(x=price))+geom_histogram()
Output: Hitogram
> ggplot(data=diamonds,aes(x=price))+geom_bar()
Output: BoxPlot
where,
X indicates the arithmetic mean xi indicates ith value in data vector n indicates total
number of observations. In R language, arithmetic mean can be calculated
by mean() function.
Syntax: mean(x, trim, na.rm = FALSE)
Parameters:
x: Represents object
trim: Specifies number of values to be removed from each side of object before
calculating the mean. The value is between 0 to 0.5
na.rm: If TRUE then removes the NA value from x
Example:
# Defining vector
x <- c(3, 7, 5, 13, 20, 23, 39, 23, 40, 23, 14, 12, 56, 23)
# Print mean
print(mean(x))
Output:
[1] 21.5
Geometric Mean
The geometric mean is a type of mean that is computed by multiplying all the data
values and thus, shows the central tendency for given data distribution.
Formula:
where,
X indicates geometric mean xi indicates ith value in data vector n indicates total
number of observations prod() and length() function helps in finding the geometric
mean for given set of numbers as there is no direct function for geometric mean.
Syntax:
prod(x)^(1/length(x))
where,
prod() function returns the product of all values present in vector x
length() function returns the length of vector x
Example:
# Defining vector
print(prod(x)^(1 / length(x)))
Output:
[1] 7.344821
Harmonic Mean
Harmonic mean is another type of mean used as another measure of central
tendency. It is computed as reciprocal of the arithmetic mean of reciprocals of the
given set of values.
Formula:
where,
X indicates harmonic mean xi indicates ith value in data vector
n indicates total number of observations
Example:
Modifying the code to find the harmonic mean of given set of values.
# Defining vector
Output:
[1] 2.807018
Median
Median in statistics is another measure of central tendency which represents the
middlemost value of a given set of values.
In R language, median can be calculated by median() function.
Syntax: median(x, na.rm = FALSE)
Parameters:
x: It is the data vector
na.rm: If TRUE then removes the NA value from x
Example:
# Defining vector
x <- c(3, 7, 5, 13, 20, 23, 39, 23, 40, 23, 14, 12, 56, 23)
# Print Median
median(x)
Output:
[1] 21.5
Mode
The mode of a given set of values is the value that is repeated most in the set. There
can exist multiple mode values in case if there are two or more values with matching
maximum frequency.
Example 1: Single-mode value
In R language, there is no function to calculate mode. So, modifying the code to find
out the mode for a given set of values.
# Defining vector
x <- c(3, 7, 5, 13, 20, 23, 39, 23, 40, 23, 14, 12, 56, 23, 29, 56, 37, 45, 1, 25, 8)
y <- table(x)
print(y)
# Mode of x
# Print mode
print(m)
Output:
x
1 3 5 7 8 12 13 14 20 23 25 29 37 39 40 45 56
1 1 1 1 1 1 1 1 1 4 1 1 1 1 1 1 2
[1] "23"
Example 2: Multiple Mode values
# Defining vector
x <- c(3, 7, 5, 13, 20, 23, 39, 23, 40, 23, 14, 12, 56, 23, 29, 56, 37,45, 1, 25, 8, 56, 56)
y <- table(x)
print(y)
# Mode of x
# Print mode
print(m)
Output:
x
1 3 5 7 8 12 13 14 20 23 25 29 37 39 40 45 56
1 1 1 1 1 1 1 1 1 4 1 1 1 1 1 1 4
[1] "23" "56"
➢ Measures of Variablity
Following are some of the measures of variablity that R offers to differentiate
between data sets:
• Variance
• Standard Deviation
• Range
• Mean Deviation
• Interquartile Range
Variance
Variance is a measure that shows how far is each value from a particular point,
preferably mean value. Mathematically, it is defined as the average of squared
differences from the mean value.
Formula:
where,
σ2 specifies variance of the data set
xi specifies ith value in data set
µ specifies the mean of data set
n specifies total number of observations
In the R language, there is a standard built-in function to calculate the variance of a
data set.
Syntax: var(x)
Parameter:
x: It is data vector
Example:
# Defining vector
# Print variance of x
print(var(x))
Output:
[1] 23.76667
Standard Deviation
Standard deviation in statistics measures the spreadness of data values with respect
to mean and mathematically, is calculated as square root of variance.
Formula:
where,
σ2 specifies standard deviation of the data set
xi specifies ith value in data set
µ specifies the mean of data set
n specifies total number of observations
In R language, there is no standard built-in function to calculate the standard
deviation of a data set. So, modifying the code to find the standard deviation of data
set.
Example:
# Defining vector
# Standard deviation
d <- sqrt(var(x))
print(d)
Output:
[1] 4.875107
Range
Range is the difference between maximum and minimum value of a data set. In R
language, max() and min() is used to find the same, unlike range() function that
returns the minimum and maximum value of data set.
Example:
# Defining vector
x <- c(5, 5, 8, 12, 15, 16)
print(range(x))
print(max(x)-min(x))
Output:
[1] 5 16
[1] 11
Mean Deviation
Mean deviation is a measure calculated by taking an average of the arithmetic mean
of the absolute difference of each value from the central value. Central value can be
mean, median, or mode.
Formula:
where,
xi specifies ith value in data set
µ specifies the mean of data set
n specifies total number of observations
In R language, there is no standard built-in function to calculate mean deviation. So,
modifying the code to find mean deviation of the data set.
Example:
# Defining vector
# Mean deviation
md <- sum(abs(x-mean(x)))/length(x)
print(md)
Output:
[1] 4.166667
Interquartile Range
Interquartile Range is based on splitting a data set into parts called as quartiles.
There are 3 quartile values (Q1, Q2, Q3) that divide the whole data set into 4 equal
parts. Q2 specifies the median of the whole data set.
Mathematically, the interquartile range is depicted as:
IQR = Q3 – Q1
where,
Q3 specifies the median of n largest values
Q1 specifies the median of n smallest values
In R language, there is built-in function to calculate the interquartile range of data
set.
Syntax: IQR(x)
Parameter:
x: It specifies the data set
Example:
# Defining vector
print(IQR(x))
Output:
[1] 8.5
13. Hypothesis Testing
A hypothesis is made by the researchers about the data collected for any experiment
or data set. A hypothesis is an assumption made by the researchers that are not
mandatory true. In simple words, a hypothesis is a decision taken by the researchers
based on the data of the population collected. Hypothesis Testing in R
Programming is a process of testing the hypothesis made by the researcher or to
validate the hypothesis. To perform hypothesis testing, a random sample of data
from the population is taken and testing is performed. Based on the results of testing,
the hypothesis is either selected or rejected. This concept is known as Statistical
Inference. the four-step process of hypothesis testing are One sample T-Testing,
Two-sample T-Testing, Directional Hypothesis, one sample T-test, two sample T-
test and correlation test in R programming.
x <- rnorm(100)
Output:
One Sample t-test
data: x
t = -49.504, df = 99, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 5
95 percent confidence interval:
-0.1910645 0.2090349
sample estimates:
mean of x
0.008985172
Two Sample T-Testing
In two sample T-Testing, the sample vectors are compared. If var.equal = TRUE, the
test assumes that the variances of both the samples are equal.
Syntax: t.test(x, y)
Parameters:
x and y: Numeric vectors
Example:
x <- rnorm(100)
y <- rnorm(100)
t.test(x, y)
Output:
Welch Two Sample t-test
data: x and y
t = -1.0601, df = 197.86, p-value = 0.2904
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.4362140 0.1311918
sample estimates:
mean of x mean of y
-0.05075633 0.10175478
Directional Hypothesis
Using the directional hypothesis, the direction of the hypothesis can be specified
like, if the user wants to know the sample mean is lower or greater than another
mean sample of the data.
Syntax: t.test(x, mu, alternative)
Parameters:
x: represents numeric vector data
mu: represents mean against which sample data has to be tested
alternative: sets the alternative hypothesis
Example:
x <- rnorm(100)
Output:
One Sample t-test
data: x
t = -20.708, df = 99, p-value = 1
alternative hypothesis: true mean is greater than 2
95 percent confidence interval:
-0.2307534 Inf
sample estimates:
mean of x
-0.0651628
One Sample T-Test
This type of test is used when comparison has to computed on one sample and the
data is non-parametric. It is performed using wilcox.test() function in R programming.
Syntax: wilcox.test(x, y, exact = NULL)
Parameters:
x and y: represents numeric vector
exact: represents logical value which indicates whether p-value be computed
To know about more optional parameters of wilcox.test(), use below command:
help("wilcox.test")
Example:
# Define vector
x <- rnorm(100)
Output:
Wilcoxon signed rank test with continuity correction
data: x
V = 2555, p-value = 0.9192
alternative hypothesis: true location is not equal to 0
Two Sample T-Test
This test is performed to compare two samples of data.
Example:
# Define vectors
x <- rnorm(100)
y <- rnorm(100)
wilcox.test(x, y)
Output:
Wilcoxon rank sum test with continuity correction
data: x and y
W = 5300, p-value = 0.4643
alternative hypothesis: true location shift is not equal to 0
Correlation Test
This test is used to compare the correlation of the two vectors provided in the
function call or to test for the association between the paired samples.
Syntax: cor.test(x, y)
Parameters:
x and y: represents numeric data vectors
To know about more optional parameters in cor.test() function, use below
command:
help("cor.test")
Example:
cor.test(mtcars$mpg, mtcars$hp)
Output:
Pearson's product-moment correlation
library(MASS)
print(str(survey))
Output:
'data.frame': 237 obs. of 12 variables:
$ Sex : Factor w/ 2 levels "Female","Male": 1 2 2 2 2 1 2 1 2 2 ...
$ Wr.Hnd: num 18.5 19.5 18 18.8 20 18 17.7 17 20 18.5 ...
$ NW.Hnd: num 18 20.5 13.3 18.9 20 17.7 17.7 17.3 19.5 18.5 ...
$ W.Hnd : Factor w/ 2 levels "Left","Right": 2 1 2 2 2 2 2 2 2 2 ...
$ Fold : Factor w/ 3 levels "L on R","Neither",..: 3 3 1 3 2 1 1 3 3 3 ...
$ Pulse : int 92 104 87 NA 35 64 83 74 72 90 ...
$ Clap : Factor w/ 3 levels "Left","Neither",..: 1 1 2 2 3 3 3 3 3 3 ...
$ Exer : Factor w/ 3 levels "Freq","None",..: 3 2 2 2 3 3 1 1 3 3 ...
$ Smoke : Factor w/ 4 levels "Heavy","Never",..: 2 4 3 2 2 2 2 2 2 2 ...
$ Height: num 173 178 NA 160 165 ...
$ M.I : Factor w/ 2 levels "Imperial","Metric": 2 1 NA 2 2 1 1 2 2 2 ...
$ Age : num 18.2 17.6 16.9 20.3 23.7 ...
NULL
The above result shows the dataset has many Factor variables which can be
considered as categorical variables. For our model, we will consider the variables
“Exer” and “Smoke“.The Smoke column records the students smoking habits while
the Exer column records their exercise level. Our aim is to test the hypothesis
whether the students smoking habit is independent of their exercise level at .05
significance level.
# Create a data frame from the main data set.
stu_data = data.frame(survey$Smoke,survey$Exer)
stu_data = table(survey$Smoke,survey$Exer)
print(stu_data)
Output:
Freq None Some
Heavy 7 1 3
Never 87 18 84
Occas 12 3 4
Regul 9 1 7
And finally we apply the chisq.test() function to the contingency table stu_data.
print(chisq.test(stu_data))
Output:
Pearson's Chi-squared test
data: stu_data
X-squared = 5.4885, df = 6, p-value = 0.4828
As the p-value 0.4828 is greater than the .05, we conclude that the smoking habit is
independent of the exercise level of the student and hence there is a weak or no
correlation between the two variables.
The complete R code is given below.
# R program to illustrate
# Chi-Square Test in R
library(MASS)
print(str(survey))
stu_data = data.frame(survey$Smoke,survey$Exer)
stu_data = table(survey$Smoke,survey$Exer)
print(stu_data)
print(chisq.test(stu_data))
So, in summary, it can be said that it is very easy to perform a Chi-square test using
R. One can perform this task using chisq.test() function in R.
The Dataset
The mtcars(motor trend car road test) dataset is used which consist of 32 car brands
and 11 attributes. The dataset comes preinstalled in dplyr package in R.
To get started with ANOVA, we need to install and load the dplyr package.
install.packages(dplyr)
library(dplyr)
boxplot(mtcars$disp~factor(mtcars$gear),
summary(mtcars_aov)
Output:
The box plot shows the mean values of gear with respect of displacement. Here
categorical variable is gear on which factor function is used and continuous variable
is disp.
The summary shows that the gear attribute is very significant to displacement(Three
stars denoting it). Also, the P value is less than 0.05, so proves that gear is
significant to displacement i.e related to each other and we reject the Null
Hypothesis.
install.packages(dplyr)
library(dplyr)
summary(mtcars_aov2)
Output:
The box plot shows the mean values of gear with respect of displacement. Hear
categorical variables are gear and am on which factor function is used and
continuous variable is disp.
Results