Project Sample Solution
Project Sample Solution
Project Sample Solution
The results of the tests, with distances measured to the nearest yard, are
contained in the data set “Golf”. Prepare a Managerial Report
1. Formulate and present the rationale for a hypothesis test that par could
use to compare the driving distances of the current and new golf balls
2. Analyze the data to provide the hypothesis testing conclusion. What is the
p-value for your test? What is your recommendation for Par Inc.?
3. Provide descriptive statistical summaries of the data for each model
4. What is the 95% confidence interval for the population mean of each
model, and what is the 95% confidence interval for the difference
between the means of the two population?
5. Do you see a need for larger sample sizes and more testing with the golf
balls? Discuss.
3| Page
2 Assumptions
The sample size of the data set is 40 (> 30) from each model of golf ball.
Central Limit Theorem states that irrespective of the shape of the original
population, the sampling distribution of the mean will approach a normal
distribution as the size of the sample increases and becomes large (>30).
We also assume that the sample estimate will be reflective of the reality.
4| Page
Feature Minimum Maximum Average Value
Code Value Value
Current 255 yards 289 yards 270.3 yards
New 250 yards 289 yards 267.5 yards
In this step, the features are explored in detail. The goal is to describe or
summarize data in ways that are meaningful and useful for insights
generation. It provides simple summaries about the sample and the
measures. Together with simple graphics analysis, it forms the basis of
virtually every quantitative analysis of data.
Given both the features – ‘Current’ and ‘New’ are continuous in nature; the
following measures are relevant to understand the central tendency and
spread of the variable.
5| Page
3.2.2 Measures of Dispersion:
Measures of
Current New
Dispersion
Range 34.0 39.0
1st Quartile 263.0 262.0
3rd Quartile 275.2 274.5
Inter Quartile Range
12.2 12.5
(IQR)
Variance 76.6 97.9
Standard Deviation 8.8 9.9
The data for ‘Range’, ‘1st Quartile’ & ‘3rd Quartile’ have been obtained
using the R function ‘summary’ (The output is provided in the section 9.1
Measures of Central Tendency).
The data for ‘Inter Quartile Range (IQR)’ has been computed and is the
difference in value between 3rd Quartile (75 percentile) & 1 st Quartile (25
percentile).
6| Page
3.3 Data Visualization – Histogram and Boxplot:
• Null Hypothesis
o It is a hypothesis that says there is no statistical significance
between the two variables. The null hypothesis is formulated such
that the rejection of the null hypothesis proves the alternative
hypothesis is true
• Alternative Hypothesis
7| Page
o It is one that states there is a statistically significant relationship
between two variables. The alternative hypothesis is the hypothesis
used in hypothesis testing that is contrary to the null hypothesis
In Par Inc. case, the management would like to produce the new golf balls once
it is comparable to the current golf balls. A sample of 40 balls of both the current
and new models were tested with a mechanical hitting machine so that any
difference between the mean distances for the two models could be attributed to
a difference in the two models. Therefore, a hypothesis test that Par Inc. could
use to compare the driving distance of the current and new golf balls. The Null
Hypothesis & Alternative Hypothesis is formulated as follow:
Where,
By formulation of above hypotheses, we assume that the current and new golf
balls show no significant difference to each other.
• One machine
• Two populations
• No other influences considered
• Independently chosen
t.test(Current,New)
##
## Welch Two Sample t-test
##
## data: Current and New
## t = 1.3284, df = 76.852, p-value = 0.188
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.384937 6.934937
## sample estimates:
## mean of x mean of y
## 270.275 267.500
8| Page
Since it is a two-tailed test, the p-value = 0.188 ÷ 2 = 0.094.
The p-value for the two-tailed test is 0.094, which is greater than level of
significance α (0.05). Therefore, the Null Hypothesis (H0) will not be rejected.
The conclusion is that this data does not provide statistical evidence that the
new golf balls have either a lower mean driving distance or a higher mean
driving distance. This implies that Par Inc. should take the new golf balls in
production as the p-value indicate that there is no significant difference between
estimated population mean of current as well as new golf balls.
t.test(Current)
##
## One Sample t-test
##
## data: Current
## t = 195.29, df = 39, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 267.4757 273.0743
## sample estimates:
## mean of x
## 270.275
Inference: The 95% confidence interval of population mean for Current model
is between 267.4757 & 273.0743. This implies that, with 95% confidence, we
can say that the sample mean driving distance of current balls will be within this
range.
t.test(New)
##
## One Sample t-test
##
## data: New
## t = 170.94, df = 39, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 264.3348 270.6652
## sample estimates:
## mean of x
## 267.5
9| Page
Inference: The 95% confidence interval of population mean for New model is
between 264.3348 & 270.6652. This implies that, with 95% confidence, we can
say that the sample mean driving distance of New balls will be within this range.
95% confidence interval for the difference between the means of the
two population:
t.test(Current,New)
##
## Welch Two Sample t-test
##
## data: Current and New
## t = 1.3284, df = 76.852, p-value = 0.188
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.384937 6.934937
## sample estimates:
## mean of x mean of y
## 270.275 267.500
• Execute the power T Test, with current parameters, and decide if larger
size is needed.
• Calculate the samples number (in case Power of Test is insignificant)
#
# Calculation to see need of Larger Sample Size:
# Power of the test
delta=mean(Current)-mean(New)
10 | P a g e
pooledSD <- (((40-1)*(8.75^2)+(40-1)*(9.9^2))/(40+40-2))^0.5
delta
## [1] 2.775
pooledSD
## [1] 9.342711
Power T Test:
##
## Two-sample t test power calculation
##
## n = 40
## delta = 2.775
## sd = 9.342
## sig.level = 0.05
## power = 0.258536
## alternative = two.sided
##
## NOTE: n is number in *each* group
Inference: The Power of test is 0.258 or 25.8%, which means there are only
25% chances that the null hypothesis will not be rejected when it is false.
Hence, we should revisit the number of samples to increase the power of test.
Consider Power of test 95%, and significance level 0.188 (The P value
calculated) and execute the Power T test once again.
#
# Sample size required
#
power.t.test(power=0.95, delta = 2.775,
sd=9.342,sig.level = 0.188,type = "two.sample",
alternative = "two.sided" )
##
## Two-sample t test power calculation
##
## n = 199.2145
## delta = 2.775
## sd = 9.342
## sig.level = 0.188
## power = 0.95
## alternative = two.sided
##
## NOTE: n is number in *each* group
11 | P a g e
Inference: We can see that we need sample size of 200 (rounded up) to get
95% power of Test.
12 | P a g e
5 Appendix – Source Code
#=======================================================================
#
# New Golf Ball Design - Should be Launched or not
#
#=======================================================================
# Environment Set up and Data Import
#=======================================================================
# Install Packages
#=======================================================================
#
# Setup Working Directory
setwd("D:/00 Great Lakes/BACP.Aug17 FBS")
getwd()
#
# Read Input File
golf_data=read.csv("golf.csv")
attach(golf_data)
#
## [1] 40 2
#
# Check top 6 and bottom 6 Rows of the Dataset
head(golf_data)
tail(golf_data)
#
#Check for Missing Values
colSums(is.na(golf_data))
## Current New
## 0 0
13 | P a g e
5.2 Descriptive Statistics
#
# Provide Summary of a Dataset.
summary(golf_data)
## Current New
## Min. :255.0 Min. :250.0
## 1st Qu.:263.0 1st Qu.:262.0
## Median :270.0 Median :265.0
## Mean :270.3 Mean :267.5
## 3rd Qu.:275.2 3rd Qu.:274.5
## Max. :289.0 Max. :289.0
#
# Check all values of a Feature with it's frequencies.
table(Current)
## Current
## 255 258 259 260 261 262 263 264 265 266 267 268 270 272 273 274 275 276
## 1 2 1 2 1 1 3 2 1 2 2 1 2 3 1 2 3 1
## 278 279 280 281 283 284 287 289
## 1 1 1 1 2 1 1 1
table(New)
## New
## 250 251 253 255 259 260 261 262 263 264 266 268 269 270 271 272 274 276
## 2 1 1 1 1 2 1 4 4 3 2 1 2 1 1 1 2 1
## 277 278 279 280 281 283 286 289
## 1 1 1 1 2 1 1 1
#
#=======================================================================
# Feature Exploration
#=======================================================================
# Data Visualization using Graphs:
# Histogram for Continuous Variables
# Box Plot to see the Outliers in Continuous Variables
#=======================================================================
#
dev.off() # To Reset the earlier partition command.
## null device
## 1
par(mfrow=c(2,2))
hist(Current,main='Current Ball',xlab = "Driving Distance",ylab =
"Frequency",col = "turquoise")
hist(New,main='New Ball',xlab = "Driving Distance",
ylab = "Frequency",col = "turquoise1")
#
boxplot(Current,main='Current Ball',xlab = "Driving Distance",
ylab = "Frequency",col = "turquoise",horizontal = TRUE)
boxplot(New,main='New Ball',xlab = "Driving Distance",
ylab = "Frequency",col = "turquoise1",horizontal = TRUE)
14 | P a g e
#
#=======================================================================
# Mean, Standard Deviation and Variance
#=======================================================================
# Normal mean(),sd(),var() Functions are used.
# round() function is used with parameter digits,
# to display the results up to two decimal places
#=======================================================================
#
Col_Head <- c("Mean","Standard Deviation","Variance","Remark")
Current_Stats <- c(round(mean(Current),digits = 2),
round(sd(Current),digits = 2),
round(var(Current),digits = 2),
"Current Ball ")
New_Stats <- c(round(mean(New),digits = 2),
round(sd(New),digits = 2),
round(var(New),digits = 2),
"New Ball Stats")
#
Combined_Stats <- rbind(Col_Head,Current_Stats,New_Stats)
Combined_Stats
#
# Welch Two Sample t test
# Also provides 95% Confidence Interval for the difference in means.
t.test(Current,New)
##
## Welch Two Sample t-test
##
## data: Current and New
## t = 1.3284, df = 76.852, p-value = 0.188
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.384937 6.934937
## sample estimates:
## mean of x mean of y
## 270.275 267.500
##
## One Sample t-test
##
## data: Current
## t = 195.29, df = 39, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 267.4757 273.0743
15 | P a g e
## sample estimates:
## mean of x
## 270.275
t.test(New)
##
## One Sample t-test
##
## data: New
## t = 170.94, df = 39, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 264.3348 270.6652
## sample estimates:
## mean of x
## 267.5
#
# Calculation to see need of Larger Sample Size:
# Power of the test
delta=mean(Current)-mean(New)
pooledSD <- (((40-1)*(8.75^2)+(40-1)*(9.9^2))/(40+40-2))^0.5
delta
## [1] 2.775
pooledSD
## [1] 9.342711
##
## Two-sample t test power calculation
##
## n = 40
## delta = 2.775
## sd = 9.342
## sig.level = 0.05
## power = 0.258536
## alternative = two.sided
##
## NOTE: n is number in *each* group
#
# Sample size required
#
power.t.test(power=0.95, delta = 2.775,
sd=9.342,sig.level = 0.188,type = "two.sample",
alternative = "two.sided" )
##
## Two-sample t test power calculation
16 | P a g e
##
## n = 199.2145
## delta = 2.775
## sd = 9.342
## sig.level = 0.188
## power = 0.95
## alternative = two.sided
##
## NOTE: n is number in *each* group
#
#=======================================================================
#
# T H E - E N D
#
#=======================================================================
17 | P a g e