Business Statistics 4-6
Business Statistics 4-6
Business Statistics 4-6
SEGMENTS 4-6
Segment: Sampling and Estimation
Topic: Random Sampling Methods
Random Sampling Methods
Table of Contents
2
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Random Sampling Methods
Introduction
When a population is large, it may be difficult or impossible to measure every item in the
population and calculate parameters such as the population mean or the population proportion.
In these cases, we must resort to sampling.
This topic discusses some of the concepts in sampling and methods of selecting random samples.
Learning Objectives
At the end of this topic, you will be able to:
identify the need for sampling methods
distinguish the different types of random sampling methods
recognise the significance of each sampling method.
3
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Random Sampling Methods
1. Concepts in Sampling
We quite often have to make inferences about some characteristic of a population of interest.
For example, we may be interested in determining the average weekly expenditure on
entertainment per household for a given area. Another example might be if we were interested
in determining the proportion of a given population that watches a particular television program.
In each of these cases, it may be difficult or impossible to contact each member of the
population. In these cases, we need to identify a sample of the population and then obtain
information from that sample. The sample is a subset of the population that is representative of
the characteristics of the population.
Besides the difficulty in determining the population of interest, other important reasons why we
might prefer to take a sample include issues of time and cost.
The population is the set of all members from the group in which we would like to draw
inferences. Before we can take the sample, we first need a list of all members of the population.
This list is called the sample frame.
Our sample should be representative of our population. One way to ensure this is to take a
probability sample. A properly designed sampling experiment should ensure that there is no
sampling bias and that our sample is representative of the population of interest.
An excellent example of a biased sample is that of the Literary Digest and the 1936 US
presidential election polls.
The Literary Digest held a poll that forecast that Alfred Landon would defeat Franklin
Roosevelt by 57% to 43%. The result of the election was that Roosevelt won by a landslide
victory, getting about 62% of the vote.
The problem with the poll was that they had used lists of telephone and automobile owners
to select their sample. In those days, these were luxuries, so their sample consisted mainly of
4
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Random Sampling Methods
middle and upper-class citizens. The majority of this group voted for Landon, but the lower
classes voted for Roosevelt.
Because their sample was biased towards wealthier citizens, their result was incorrect.
Source: Bowerman, B., R. O'Connell and M. Hand. Business Statistics in Practise. Boston: McGraw Hill-
Irwin, 2001.
In the following sections, we will look at a number of different random sampling techniques
that are commonly used.
1. One method is to give each member of the population a number and then choose the
sample, of size n, by random number tables or by using software packages.
This graphic shows a population made of 10 people, out of which persons 2, 5, 7, and 10 are
randomly selected. This represents simple random sampling.
2. The second method is to choose every nth item from the population frame. This is also
referred to as systematic sampling.
5
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Random Sampling Methods
When we are sampling, we would like to extract as much information from the sample as
possible. One method that can achieve this is stratified random sampling.
Stratified random sampling is a method where the population is separated into sub-populations
called strata. Then instead of taking a random sample from the entire population, we take a
simple random sample from each of the strata.
The aim when selecting strata is to try and ensure that there is as much variation as possible
between the strata but as little as possible variation within the strata.
3. Cluster Sampling
Another method of random sampling is cluster sampling. In this method, we separate the
population into clusters and then take simple random samples from a number of these clusters.
The clusters are usually based on geographic dimensions such as cities or suburbs.
6
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Random Sampling Methods
7
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Random Sampling Methods
Below is an exercise to identify the three main types of probability sampling techniques.
5. Summary
Samples are randomly selected from populations for the purpose of drawing inferences
about population parameters.
Simple random sampling is a sampling method where each member of the population
has an equal probability of being chosen.
Stratified sampling is a sampling method where the population is separated into strata
from which simple random samples are taken.
Cluster sampling is a sampling method where the population is separated into clusters
from which simple random samples are drawn.
8
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Random Sampling Methods
6. Glossary
Sampling The process of selecting a subset of the population with a view to
study the characteristics of that population.
Population The complete set in which we are interested.
Random sample A sample where every item in the population is given an equal chance
of being selected.
7. Answers
Exercise: Sampling Methods
In simple random sampling, members are selected randomly from a population. In cluster
sampling, heterogenous groups are created and members selected randomly from each cluster.
Finally, in stratified random sampling, homogenous groups are created and then members are
randomly selected from each stratum.
9
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Segment: Sampling and Estimation
Topic: Sampling Distributions
Sampling Distributions
Table of Contents
2
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Sampling Distributions
Introduction
In many business situations, we would like to estimate some population parameter of interest.
To do this, we can take a sample from the population and try and estimate the population
parameter from the sample.
In order to understand the possible outcomes, when we take our sample, we first need to
understand the concept of the sampling distribution. This concept is crucial in later situations
when we try to make inferences about the population.
In this topic, we shall investigate the concept of the sampling distribution of the sample mean.
Learning Objectives
At the end of this topic, you will be able to:
recognise the need for the sampling distribution of the sample mean
comprehend the central limit theorem and its practical usage.
3
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Sampling Distributions
1. A population of size N
2. With a mean, µ
3. A standard deviation, σ
We draw samples of size n from this and we calculate sample statistics such as:
But, how accurate is the sample mean compared to the population mean, µ?
The sampling distribution of the sample mean is the key to understand this problem.
Understanding the concept of sampling distribution is the conceptual bridge between the
probability distribution of a population and statistical inference based on sample data. If we have
knowledge of the sampling distribution, we can answer questions about the accuracy of the
result from our sample data.
Assume for a moment that we are sampling from a population with a known µ and σ. We take a
sample of a certain size n from this population and calculate the sample mean, . Imagine that
we continue to take samples of size n from this population and calculate the sample mean. If we
then take each of these sample means and construct a histogram, it will show the probability
distribution of all possible values of . In other words, it is the sampling distribution of the
sample mean.
1. The mean of the sampling distribution of the sample mean in each case is equal to the
population mean from which we have sampled. That is
This formula states that the population mean, subscript, is equal to the population mean.
2. The standard deviation of the sampling distribution in each case is equal to the population
standard deviation divided by the square root of the sample size. That is
4
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Sampling Distributions
This formula states that standard deviation, subscript, is equal to the standard deviation
divided by the square root of the sample size.
3. If the population is normal, then the sampling distribution of the sampling mean is also
normal.
It is extremely important to note the difference between the population standard deviation and
the standard deviation of the sample mean. Understanding this difference is the key to
understanding the concepts involved in statistical inference.
The standard deviation of the sample mean is also known as the standard error. In order to avoid
confusion, we will refer to the standard deviation of the sample mean as the standard error in
the following topics.
A remarkable discovery called the central limit theorem solves this problem though. It is arguably
one of the most important theorems in the development of statistical inference.
The central limit theorem states that even if the population from which we are sampling is not
normal, as long as our sample size is sufficiently large, the sampling distribution of the sample
mean will be approximately normal.
The larger the sample size, the more closely the sampling distribution will resemble a normal
distribution.
In practice, a sample size of 30 or more is usually sufficient to ensure that our sampling
distribution will be approximately normal.
5
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Sampling Distributions
Below is an exercise to practice what you have learnt about the central limit theorem.
A researcher is trying to estimate the salaries of MBA graduates two years after graduation.
Suppose that, unknown to the researcher, the distribution of the salaries of these graduates
is right-skewed with a mean of US$100,000 and a standard deviation of US$60,000.
Question 1: If the researcher sampled 30 students, which of the following would best
represent the sampling distribution of the sample mean:
1. Approximately normal with a mean of US$100,000 and a standard deviation of US$60,000
2. A T-distribution with a mean of US$100,000 and a standard deviation US$10,954
3. Approximately normal with a mean of US$100,000 and a standard deviation of US$10,954
4. A T-distribution with a mean of US$100,000 and a standard deviation of US$60,000
Question 2: Now, if the researcher sampled 100 students, which of the following would best
represent the sampling distribution of the sample mean?
1. Approximately normal with a mean of US$100,000 and a standard deviation of US$10,954
2. Approximately normal with a mean of US$60,000 and a standard deviation US$6,000
3. Approximately normal with a mean of US$100,000 and a standard deviation of US$6000
4. Approximately normal with a mean of US$60,000 and a standard deviation of US$10,954
3. Summary
The sampling distribution of the sample mean describes the probability distribution for
the possible values of our sample mean.
If our population is normal then our sampling distribution will also be normal, regardless
of the sample size, with a mean equal to the population mean.
The standard deviation of the sample mean is also known as the standard error.
The central limit theorem states that regardless of the shape of the population, as long
as our sample size is sufficiently large, the sampling distribution of the sample mean will
be approximately normal.
6
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Sampling Distributions
4. Glossary
Sample A set of items selected from a population. Most of the time we are
interested in random samples, where the selection of item is
carried out giving equal chance of selection to every item in the
population.
Standard deviation The standard deviation of a data set is the positive square root of
the variance.
Population The complete set in which we are interested.
Probability The probability of an event is the likelihood of occurrence of that
event.
5. Answers
Exercise: Central Limit Theorem
Question 1:
The correct answer is option 3. Approximately normal with a mean of US$100,000 and a
standard deviation of US$10,954.
Since the sample size is greater than 30, the central limit theorem says that the sampling
distribution of the sample mean will be approximately normal with a mean of US$100,000 and
a standard deviation of US$10,954 (given by sigma divided by the square root of n).
Question 2:
The correct answer is option 3. Approximately normal with a mean of US$100,000 and a
standard deviation of US$6000.
Since the sample size is greater than 30, the central limit theorem says that the sampling
distribution of the sample mean will be approximately normal with a mean of US$100,000 and
a standard deviation of US$6,000 (given by sigma divided by the square root of n).
7
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Segment: Sampling and Estimation
Topic: Confidence Interval for a Mean
Confidence Interval for a Mean
Table of Contents
2
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Confidence Interval for a Mean
Introduction
In many situations, we need to make inferences about population parameters of interest. One
inference that we may be interested in making is an estimate of the population mean. To do this,
we need to make use of our sampling distribution.
In this topic, we discuss estimation and develop the concept of the interval estimate for a
population mean.
Learning Objectives
At the end of this topic, you will be able to:
describe the two estimation methods
use the T-distribution to identify the best estimate when the population standard deviation
is unknown
illustrate how to determine the sample size.
3
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Confidence Interval for a Mean
1. Estimation Concepts
The aim of the estimation is to determine the value of a population parameter of interest. For
example, the sample mean is an estimate of the true population mean.
1. Point estimate
2. Interval estimate
Point estimates
A point estimate is our 'best guess' of the true population parameter. Suppose we are trying
to estimate the average income of students at a particular university, we could do this by
taking a sample of a certain size and calculating the sample mean. This sample mean is a point
estimate and is our best guess of the true value. While this estimate might be suitable, it does
not take into account all the information in the sample we have collected.
For example, one other sample statistic we could calculate is the sample standard deviation.
In addition, we have no indication of how accurate our 'estimate' is compared to the true (and
unknown) population mean.
Interval estimate
An alternative approach is the interval estimate. In this method, we specify an interval over
which we have a degree of confidence that the true parameter lies. To do this, we need to
make use of our sampling distribution.
4
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Confidence Interval for a Mean
This formula states that standard deviation, subscript, is equal to standard deviation divided
by the square root of n.
Given that the sampling distribution is approximately normal, we can state that there is
approximately a 68% chance that our sample mean will fall in the range
This formula states that if the sampling distribution is approximately normal the sample mean
will fall in the range of the population mean plus or minus one multiplied by the standard
deviation subscript.
Likewise, there is approximately a 95% chance that the sample mean will fall between
This formula states that the sample mean will fall between the population mean plus or minus
two multiplied by the standard deviation subscript.
This formula states that is equal to the population mean plus or minus Z multiplied by the
standard deviation subscript.
This equation can then be rearranged in order to obtain the general form of the interval estimate
or confidence interval as shown below:
5
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Confidence Interval for a Mean
This formula states that the population mean is equal to plus or minus Z multiplied by the
standard deviation subscript.
Then, to obtain the confidence interval for the population mean, all we need to do is specify the
confidence level and we can derive the confidence interval. Typical confidence levels are 90%,
95%, or 99%, with 95% being the most commonly used.
For example, suppose we are interested in determining the 95% confidence interval for the
average amount that inner-city workers in a particular city spend on coffee each week. Assuming
that we know that the population standard deviation for this population is US$15. If we take a
sample of 100 people and find that the average of this sample is US$20, we can then calculate
the confidence interval of interest.
This formula states that the population mean is equal to plus or minus Z multiplied by the
standard deviation subscript. This is equal to 20 plus or minus one multiplied by the nine six
point 15 divided by the square root of 100 which is then equal to 20 plus or minus 2.94.
In other words, we are 95% confident that the interval US$20 ± US$2.94 created contains the
true value of the population mean for the average weekly spending on coffee.
It should also be noted that in the example discussed, there is a 5% chance that the interval we
have created will not contain the true population parameter.
6
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Confidence Interval for a Mean
In this case, the next best option is to use the sample standard deviation, s, calculated from
sample data. However, in doing this we introduce additional variability and we can no longer
guarantee that our sampling distribution will be normal.
Instead, the sampling distribution follows a T-distribution. The T-distribution is similar to the
normal distribution but has the following properties:
1. It has a mean of 0.
2. It has a standard deviation that varies with the sample size. We can specify the degree of
freedom, which is the sample size minus one.
Three different T-distributions are shown, all going from minus 4 to plus 4 on the x-axis, and, at
the zero point on the x-axis, they are all at varying heights on the y-axis. On the y-axis, the green
curve goes to zero point three the red curve goes to zero point three five and the blue curve
goes to zero point 4.
In using the sample standard deviation, the general form of the confidence interval becomes
7
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Confidence Interval for a Mean
This formula states that the population mean is equal to plus or minus t multiplied by sample
standard deviation, subscript .
Where
This formula states that sample standard deviation, subscript is equal to sample standard
deviation divided by the square root of the sample population.
The T-value indicates the number of sample standard errors by which the sample mean differs
from the population mean. If we specify the confidence level, we can then find an appropriate
T-value for our confidence interval equation.
One question that will arise is how big a sample should we take in the first place?
In most business situations, we would like to limit our uncertainty in our estimate of the
population mean.
If we let the maximum error we are willing to tolerate be represented by B, then it can be shown
that the required sample size is
Where
z is the number of standard errors associated with a given confidence level (1.96 if we
assume a 95% confidence interval).
8
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Confidence Interval for a Mean
Exercise
Below is an exercise to practice what you have learnt on confidence intervals.
Question 1: Given the above what is the sample size required if we wish to estimate the
average rental price of a two-bedroom unit to within US$20?
1. 158
2. 190
3. 200
4. 217
Question 2: Given the above what is the sample size required if we wish to estimate the
average rental price of a two-bedroom unit to within 5% of the true value?
1. 100
2. 105
3. 110
4. 139
Question 3: Given the above what is the sample size required in Question 1 if the standard
deviation in the price of two-bedroom units was twice as large (i.e., US$300)?
1. 550
2. 800
3. 825
4. 865
9
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Confidence Interval for a Mean
5. Summary
Here is a quick recap of what we have learnt so far:
6. Glossary
Estimation The process of inferring the value of an unknown parameter using
sampling.
7. Answers
Exercise: Confidence Intervals
10
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Segment: Sampling and Estimation
Topic: Confidence Interval for a Proportion
Confidence Interval for a Proportion
Table of Contents
2
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Confidence Interval for a Proportion
Introduction
When dealing with qualitative data, we can determine the proportion of times that a value of
interest occurs. The parameter of interest in these cases is the population proportion.
We might be interested in making inferences about the population proportion. This can be a
point estimate such as the sample proportion or an interval estimate as already discussed.
In this topic, we will discuss the concept of interval estimate for a proportion.
Learning Objectives
At the end of this topic, you will be able to:
• describe the sampling distribution for a population proportion
• determine the confidence interval for a population proportion
• determine the required sample size when estimating proportions.
3
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Confidence Interval for a Proportion
The relationship between the population proportion and the sample proportion is shown in the
following figure.
This graphic shows the relationship between the population proportion and the sample
proportion. A big square represents the population, with an arrow pointing to a small oval shape
which represents the random sample.
For example, suppose that we were interested in launching a new product and decided to
conduct a survey to try and find out the proportion of consumers who would be interested in
buying our product. From the survey results, we can obtain the proportion of consumers in our
sample who said that they would buy our new product. This proportion is known as the sample
proportion. Our objective is to use this result to make an inference about the true population
proportion.
The sample proportion is a point estimate of the population proportion and is represented as
4
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Confidence Interval for a Proportion
Overall, the true population proportion is p, from which we take a sample of size n. The number
of successes in a sample of size n is x. Therefore, the proportion of successes in a sample is
The number of successes (in this case the number of consumers who said that they would buy
the product) is a binomial random variable with
a mean of E(x) = np
a standard deviation of
Rather than consider the number of successes, it is far easier to talk about the proportion of
success. It can be shown as long as n is reasonably large and np > 5 and n(1- p) > 5, that the mean
of is given by
The sampling distribution of the sample proportion is approximately normally distributed, with
a mean of p, and a standard deviation is given by the square root of p(1- p)/n.
As for the case of mean, the standard deviation of the sample proportion is also known as the
standard error.
5
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Confidence Interval for a Proportion
The confidence interval is based on a large sample size. As already discussed, the required
conditions are
If these conditions are satisfied, then the sampling distribution will be approximately normal,
and the confidence interval is given by
The appropriate Z-multiple will depend on the level of confidence we require. For the 95%
confidence interval, the Z-value will be 1.96.
As an example, suppose that we are conducting a survey of 400 households to find the
proportion of those that have purchased a high definition television. Assuming our survey found
that 60 of these have made such a purchase, estimate with 99% confidence the true proportion
of households in the population of interest that have made this purchase.
The point estimate, in this case, is 60/400 or 0.15. The interval estimate or confidence interval
is given by
We first check to see if the conditions for the normal approximation are satisfied as follows:
6
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Confidence Interval for a Proportion
Since we are after the 99% confidence interval, we are after the Z-value such that the area in
both tails is equal to 0.01. In other words, the area in each tail is 0.005. Using the NORMSINV
function in Microsoft Excel®, the corresponding Z-value can be found as
This value is negative since we have calculated the Z-value corresponding to the left-hand tail.
We are only interested in the magnitude of this number.
In other words, the true population proportion of households that have purchased a high
definition television lies between the intervals of 10.4% to 19.6%.
As for the case with means, we need to decide on how big a sample to take when we are
estimating proportions. In order to do this, we again have to specify the maximum amount of
uncertainty that we are willing to tolerate in our estimate of the true population proportion.
Recall the formula for the confidence interval for a proportion. The right-hand side term after
point estimate is the uncertainty term. The uncertainty in our estimate of the population
7
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Confidence Interval for a Proportion
proportion is given by plus or minus Z multiplied by the square root of the sample proportion
multiplied by open bracket one minus the sample proportion, close bracket divided by n.
If we let the maximum error, we are willing to tolerate be denoted by B, then it can be shown
that the required sample size is given by
This formula states that the required sample size, n, is given by Z squared multiplied by the
sample proportion multiplied by open bracket one minus the sample proportion, close bracket
divided by B squared.
Where
In most cases, we do not have any knowledge of the true population proportion. In these cases,
it is best to use a value of 0.5.
Question: Based on the information above, if we believe that the population proportion is
20%, what sample size would be required to estimate the proportion of people in this city that
own a digital camera?
1. 453
2. 500
8
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Confidence Interval for a Proportion
3. 633
4. 683
4. Summary
When dealing with qualitative data, it is generally the population proportion that is of
interest.
It can be shown that as long as n is large enough such that np and n(1- p) are greater than or
equal to 5, then the sampling distribution of the sample proportion is approximately normal,
with a mean of p and a standard deviation given by the square root of p(1- p)/n.
As for means, confidence intervals can be developed for the population proportion.
The required sample size can be determined based on certain assumptions.
5. Answers
Exercise: Sample Size for Proportions
The correct answer is option 4, 683.
The answer is calculated as:
Where z=1.96 (for 95%confidence interval and e is the error in %). The numbers become
rounded up.
9
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Segment: Hypothesis Testing
Topic: Concepts in Hypothesis Testing
Concepts in Hypothesis Testing
Table of Contents
2
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Concepts in Hypothesis Testing
Introduction
A hypothesis is a tentative explanation for an observation or phenomenon that has not yet been
verified. Hypothesis testing is the process of determining whether or not a given hypothesis is
consistent with observed facts.
This topic introduces the concepts involved with hypothesis testing and some of the issues
involved in testing the hypothesis.
Learning Objectives
At the end of this topic, you will be able to:
define null and alternative hypotheses
distinguish between one and two-tailed tests
recognise the significance level and calculate the p-value
examine Type II errors and the power of a test.
3
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Concepts in Hypothesis Testing
Example
For instance, a manufacturer of light bulbs might claim that the average life of their bulbs is
at least 8,000 hours. Based on the results of a sampling experiment conducted on the bulbs,
it is possible to calculate the probability of observing the result obtained (or more extreme)
from our sampling experiment if the claim under test is true. Depending on the results
obtained, the manager can then decide whether or not to reject the claim.
Before we can test the particular theory or hypothesis though, we need to understand the
basic concepts involved in hypothesis testing.
For the case of the claim made by the manufacturer of light bulbs about the average life of
the bulbs, we have:
It could happen that someone else, such as a consumer advocate, could make a counterclaim
that µ < 8,000. If so, which claim do we take as the null hypothesis?
It is customary to formulate the null hypothesis such that if it were true, then no special action
would be necessary, and if the alternative is true, then some special action would be
necessary.
This approach would favour the H0 and H1 as outlined above. The burden of proof is usually
on the alternative hypothesis H1. It is up to the researcher to provide enough evidence in
4
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Concepts in Hypothesis Testing
support of the alternative; otherwise, we must continue to believe that the null hypothesis is
true.
In the above case, if there was enough evidence in favour of the alternative hypothesis (i.e., µ <
8,000), then some action, such as requiring the manufacturer to revoke the claim and pay for
damages caused might be necessary.
5
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Concepts in Hypothesis Testing
Once the hypotheses are set-up, it is easy to detect whether the test is one-tailed or two-tailed.
The real question is whether to set-up a hypothesis for a particular problem as one-tailed or two-
tailed. There is no statistical answer to this question. It depends entirely on what we are trying
to prove.
6
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Concepts in Hypothesis Testing
Fig. 1: Outcomes of Decisions Made When the Null Hypothesis is Rejected or Accepted
As can be seen from this figure, if we reject a null hypothesis that was false then we have made
a correct decision. Likewise, if we fail to reject a null hypothesis that was true, then we have also
made the correct decision.
However, we need to recognise that there are two types of decisions where we are making an
error:
For example, suppose a bank manager was interested in whether or not the mean waiting time
for customers had increased from its previous value. In this case, the null hypothesis might be
that the mean waiting time has not changed. If the manager sampled a group of customers and
performed a hypothesis test on the results, the possible errors that the manager could make
are:
1. Type I error: Concluding that the average waiting time had increased when it had not
2. Type II error: Concluding that the average waiting time was the same, when in fact it had
increased
7
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Concepts in Hypothesis Testing
Significance levels
The researcher must determine the maximum probability of a Type I error that he or she is
willing to tolerate. This value is denoted by α, the significance level of the test, and is most
commonly equal to 0.05, although α = 0.01 and α = 0.10 are also frequently used. Then, given
the value of α, we can use statistical theory to determine the rejection region. If the sample
evidence falls into this region, we reject the null hypothesis; otherwise, we do not reject it.
Sample evidence that falls into the rejection region is called statistically significant at the α
level.
p-value
An alternative approach avoids the use of the significance level and the rejection region and
instead simply reports how significant the sample evidence is. We can do this by calculating
the p-value of the hypothesis test. The p-value of a sample is the probability of seeing a sample
with at least as much evidence in favor of the alternative hypothesis as the sample actually
observed. The smaller the p-value, the more evidence there is in favor of the alternative
hypothesis.
Example:
Suppose the manufacturer of the light bulbs in the example above was interested in testing
whether the lifetime of the bulbs was more than 8,000 hours. The manufacturer might then
take a sample of these bulbs and find that the average life of the sample of bulbs was 8,750
hours. The p-value, in this case, is the probability that we could get a sample mean of 8,750
hours or more, assuming that the true average life of the bulbs was 8,000 hours.
In general, smaller p-values indicate more evidence in support of the alternative hypothesis. If a
p-value is sufficiently small, almost any decision-maker will conclude that rejecting the null
hypothesis is the more reasonable decision.
p-value is less than 0.01, it provides convincing evidence that the alternative hypothesis is
true.
8
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Concepts in Hypothesis Testing
p-value is between 0.01 and 0.05, there is strong evidence in favour of the alternative
hypothesis.
p-value is between 0.05 and 0.10, it is in a grey area.
p-values greater than 0.10 are interpreted as weak or no evidence in support of the
alternative.
Test statistic
To compute the p-value, we use a test statistic computed from the sample data.
The test statistic is a random variable whose distribution is well known so that certain desired
probabilities can be calculated.
For instance, consider the case where we are testing a statement about µ using a sample size of
at least 30 (n >= 30) and σ is known.
Now, consider if σ is unknown and the population is normal. Then, the test statistic would be:
where the sample standard deviation, S, has been substituted in place of the population
standard deviation, σ. Fortunately, p-values for a variety of statistical tests are easily calculated
using Excel and other softwares.
9
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Concepts in Hypothesis Testing
5. Power of a Test
The probability that we could commit a Type II error is denoted by β. A Type II error is the
probability that we will fail to reject the null hypothesis in the situation where the null hypothesis
is actually false. Given that, the probability that we would correctly reject a null hypothesis that
was actually false is given by 1 - β. This probability is known as the power of a test.
Example:
Suppose that you were the marketing manager for a pharmaceutical company that has
developed a new drug, and your company was making claims that the drug was safe. In
medical trials to test this claim, the null hypothesis would be that the drug was safe.
If we were to commit a Type II error during these tests, we would be failing to reject a null
hypothesis that was actually false. In other words, we would be saying that the drug was safe
when in fact it was not safe. In this case, we would be committing a serious error.
In this case, we would like to minimise the probability of making a Type II error. To do that,
we would like to increase the power of the test. One popular method of increasing the power
of the test is to increase the sample size.
6. Summary
Here is a quick recap of what we have learnt so far:
10
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Concepts in Hypothesis Testing
7. Glossary
Confidence level In interval estimation, it is the probability that the value being
estimated lies within the estimated interval. In hypothesis
testing, it is the complement of the level of significance.
Test statistic In hypothesis testing, a statistic computed from sample data
which follows a well-known distribution so that probabilities
such as the p-value can be calculated.
11
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Segment: Hypothesis Testing
Topic: Hypothesis Tests for a Population Mean and
Population Proportion
Hypothesis Tests for a Population Mean and Population Proportion
Table of Contents
2
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Hypothesis Tests for a Population Mean and Population Proportion
Introduction
There are many instances in business where we need to decide about a single population. For
example, a claim might be made that
The average number of complaints made by customers has increased above the usual
level.
A company's product does not weigh the same amount as stated on the packaging.
The average time spent online by internet users has increased. A sample of internet users
could be taken, and a hypothesis test conducted.
In these cases, inferences can be made about a population parameter of interest.
When our data is qualitative in nature, the population parameter of interest would be the
population proportion. The point estimator of this parameter is the sample proportion, which
under certain conditions has an approximately normal sampling distribution.
In this topic, we will explore the concept of testing a population mean and a population
proportion.
Learning Objectives
3
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Hypothesis Tests for a Population Mean and Population Proportion
In hypothesis tests of a mean, we are testing to see if our sample data is consistent with our
hypothesised mean. In order to do this, we need to consider the sampling distribution of the
sampling mean. If we know the population standard deviation and our sample size is large
enough, then the sampling distribution of the sample mean will follow a normal distribution.
In practice though, we rarely know the value of the population standard deviation, and as in
estimation, if σ is unknown, we use s as the replacement. Our sampling distribution will then
follow a T-distribution with n - 1 degrees of freedom.
Example:
Suppose an entrepreneur was considering a location for an electricity-producing windmill
farm. To be successful, the average wind speed at that particular location should be greater
than 32 km/h. In order to test the location, the entrepreneur arranged for 50 wind speed
measurements to be taken at the location on randomly selected days over a period of time.
The average wind speed from the sample of 50 days was 38 km/h with a sample standard
deviation of 25 km/h.
Should the entrepreneur go ahead with the windmill farm?
In this case, we are testing a mean with an unknown population standard deviation, so the
sampling distribution will follow a T-distribution. The null and alternative hypotheses that the
entrepreneur would be interested in testing are:
H0: µ < 32
H1: µ > 32
Assuming the null hypothesis to be true, the sampling distribution is as shown in the following
figure.
4
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Hypothesis Tests for a Population Mean and Population Proportion
The null hypothesis is that the population mean is less than or equal to 32, and the alternative
hypothesis is that the population mean is greater than 32.
The p-value is the probability that we could have obtained our sample result as extreme or more
extreme if the null hypothesis is true. In this case, it is the probability in the tail to the right of
our observed value of 38 km/h since our test is a one-tailed test.
This is a reasonably small p-value. Our decision in this case would be that there seems to be
sufficient evidence to reject the null hypothesis and conclude that the average wind speed is
greater than 32 km/h at this location.
5
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Hypothesis Tests for a Population Mean and Population Proportion
In other cases, we may be interested in performing a two-tailed test. In these cases, the p-value
is the probability in both tails.
As we already know, as long as our sample size is reasonably large such that np and n(1 - p) are
greater than or equal to 5, the sampling distribution of the sample proportion is approximately
normally distributed with a mean of p and a standard deviation given by the square root of
p(1 - p)/n. Note that the term n refers to the sample size while the term p refers to the population
proportion.
Similar to hypothesis tests for a population mean, hypothesis tests can also be performed for
population proportions. For proportions, we can calculate the sample proportion from our
sample data and then use this to calculate our p-value for the test.
Example:
A company is considering introducing a new product and has determined that it needs to
capture a market share of 10% to break even. The product will be profitable for the company
if it can capture more than 10% of the market.
Suppose that the company surveyed 400 potential customers to ask them if they would buy
the product. Suppose also that they received positive responses from 52of these people who
said they would purchase the product.
Given this data, is there enough evidence to suggest that the company should proceed with
the launch of the product?
Solution
The parameter of interest, in this case, is the population proportion of potential customers
who would buy the product. The company has decided that they need to capture more than
10% of the market, so in this case, the null and alternative hypotheses would be:
6
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Hypothesis Tests for a Population Mean and Population Proportion
The sample proportion in this case is 52/400 = 0.13. The following figure shows the sampling
distribution assuming the null hypothesis to be true and p-value for the test.
The p-value is the probability that we could have obtained our sample result as extreme or more
extreme if the null hypothesis is true. In this case, it is the probability in the tail to the right of our
observed proportion of 13% since our test is a one tailed test.
7
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Hypothesis Tests for a Population Mean and Population Proportion
The test statistic Z is equal to the sample proportion minus the hypothesised population
proportion divided by the square root of the population proportion multiplied by q (1 minus
the population proportion) divided by the sample size which is equal to 0.13 minus 0.1 divided
by the square root of 0.1 multiplied by 0.9 divided by 400 which is equal to 2.
The p-value can then be found using Excel® NORMDIST or NORMSDIST functions:
= 1 - NORMDIST (0.13, 0.1, 0.015, 1)
= 0.02275 = 2.275%
This is a reasonably small p-value. Our decision, in this case, would be to reject the null
hypothesis and conclude that the true population proportion of consumers who would buy
the product is greater than 10%.
As for means, we may be interested in performing a two-tailed test for proportions. In these
cases, the p-value is calculated as the probability in both tails.
8
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Hypothesis Tests for a Population Mean and Population Proportion
3. Summary
9
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Segment: Hypothesis Testing
Topic: Testing Differences between Population
Means
Testing Differences between Population Means
Table of Contents
2
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Testing Differences between Population Means
Introduction
We quite often have to test for differences between two population means. Depending on the
circumstances, the samples can be either independent or paired.
In this topic, we will explore the concept of testing for differences between two population
means.
Learning Objectives
3
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Testing Differences between Population Means
We saw the case where the light bulb manufacturer claimed that the average life of its light
bulbs was at least 8,000 hours. What if the claim was that their bulbs last longer than another
brand XYZ?
Now it is irrelevant what the average value is; what matters is the difference between the
averages of two brands.
Using the sampling theory, we have investigated so far, we can compare two population
averages in two different ways:
As you may guess, these methods are applicable under certain conditions. In this topic, we
shall explore the details of the two methods and their applicability.
This method is applicable when we are comparing a sample pair of items. For example, if we
wish to compare the effectiveness of a training program on employees, we can measure the
performance of the employee before and after the training. It does make sense to compare
the two measurements and take the difference. The difference can be attributed to the
effectiveness of the training. The t-test for the difference between means from paired samples
is used. The advantage of this method is that the effects of extraneous variables are controlled
and thus measured differences are less prone to error. As a result, the test is more reliable.
Following are the notations used for the comparison of two populations:
Notation for the claimed difference in the population means is D subscript zero.
4
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Testing Differences between Population Means
Notation for the standard deviation of the sample differences is italic S subscript D.
is true, and the two populations are normally distributed, the quantity will
follow a T-distribution with (n – 1) degrees of freedom. As a result, we conduct a T-test,
meaning the test statistic is t.
If a paired difference test is possible in a situation, then that must be the preferred method
since it is more reliable. But a paired difference is not always possible.
Suppose we wish to compare the average number of sick days taken by employees in two
departments of a large company over the period of a year.
In this case, we need to take one random sample from the first department and another
independent sample from the other department. We then have to take the difference between
the averages of the two samples.
The random variable of interest is the difference between the two sample means, denoted by:
5
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Testing Differences between Population Means
1.
2. Since and are independent random variables, the variance of their difference
should be the sum of their individual variances. In other words:
4. Even if the two populations are not normally distributed, as n1 and n2 increases, the
The above results enable us to conduct a Z-test when either the populations are normally
distributed, or the samples are large.
But there is a twist. The computation of Z calls for σ12 and σ22. Most of the time, they are not
known.
The formulas for the sampling distributions are slightly different for the two cases. While we
will be using Excel® to perform these calculations for us, it is instructive to be aware of the
calculations involved.
6
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Testing Differences between Population Means
Open bracket, numerator is open bracket x bar subscript one minus x bar subscript two close
brackets minus D subscript zero. Denominator is square root of S subscript p squared divided
by n subscript one, plus S subscript p squared divided by n subscript two.
7
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Testing Differences between Population Means
If the variances of the two populations can be assumed to be equal, then the equal variance
method is preferred. In cases where the variances are clearly different, then the unequal
variance case will need to be implemented. It should also be noted that there is a hypothesis
test that can be performed to test whether the two variances are equal.
Suppose that we were interested in determining whether there was any difference between
the number of sick days used by employees in two departments in a large firm over the period
of a year. The following table shows the results of the random sample of 20 employees taken
from each of the two departments.
8
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Testing Differences between Population Means
Since the samples are independent, we need to perform an independent samples test for the
difference between the two means.
Before we can do the test though, we need to determine whether the variances in the two
groups are the same. One way to do this might be to generate summary statistics for each
group and make an informed decision. Alternatively, Excel® provides a statistical test for this
comparison known as the F-test. This test can be found under Data/Data Analysis/F-test: Two-
Sample for Variances. The following table shows the Excel® output from this test.
9
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Testing Differences between Population Means
The p-value, in this case, is slightly less than 0.05, which might lead us to reject the null
hypothesis and conclude that there is sufficient evidence to conclude that the variances are
significantly different from each other.
If we were to conclude this, then the appropriate T-test for the difference between the two
means would be performed. This test can be found in Excel® under Data/Data Analysis/t-test:
Two-Sample Assuming Unequal Variances.
If we were performing a two-tailed analysis, then, as can be seen from the Excel® output, the
p-value is 0.172. In this case, the p-value is relatively large, and we would not reject the null
hypothesis. We would conclude in this instance that there was insufficient evidence to
conclude that there was a difference between the average number of sick days used by
employees in the two departments.
10
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Testing Differences between Population Means
randomly chosen days. On those days the average number of cars parked is 120, 130, 124,
127, 128.
For the total number of cars that the owner observed during the five days, the mean and the
standard deviation of the time spent at the car park were 3.6 and 0.4 hours respectively.
Question 1: Which of the following would best represent the hypothesis test the owner
would be interested in testing?
1. H0 : µ = 125, H1 : µ = 125 and H0 : µ ≥ 3.5, H1 : µ < 3.5
2. H0 : µ ≤ 125, H1 : µ > 125 and H0 : µ ≤ 3.5, H1 : µ > 3.5
3. H0 : µ ≥ 125, H1 : µ < 125 and H0 : µ ≥ 3.5, H1 : µ < 3.5
4. H0 : µ = 125, H1 : µ < 125 and H0 : µ ≥ 3.5, H1 : µ < 3.5
Question 2: Suppose that the p-values obtained for the tests were 0.002 and 0.6
respectively. The owner concludes that the employee is stealing. Is this statement true or
false?
1. True
2. False
11
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Testing Differences between Population Means
5. Summary
Population averages can be compared in two ways, namely, Paired difference method
and Independent samples method
If the variances of the two populations can be assumed to be equal, then the equal
variance method is preferred.
If the variances are clearly different, then the unequal variance case will need to be
implemented.
Hypothesis test that can be performed to test whether the two variances are equal.
Hypothesis tests for the differences between two means can be performed using
Excel®.
6. Answers
Exercise: Testing Differences in Population Means
Question 1: The correct answer is option 2, H0: µ ≤ 125, H1: µ > 125 and H0: µ ≤ 3.5, H1: µ >
3.5
12
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Segment: Regression Analysis
Topic: Simple Linear Regression – Part I
Simple Linear Regression
Table of Contents
2
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Simple Linear Regression
Introduction
Regression analysis is one of the most important and widely used statistical techniques. It has
many applications in business and economics.
We saw that there are two types of regression. In this topic, we will discuss simple linear
regression in detail. In simple linear regression, we model the relationship between two
variables.
Learning Objectives
At the end of this topic, you will be able to:
comprehend the simple linear regression model and its significance
use the least squares method to obtain the regression equation.
3
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Simple Linear Regression
If so, it would be useful to quantify their relationship so that we can predict one of them by
knowing the value of the others, or control one of them by controlling the others.
A simple type of relationship between two variables x and y is the linear relationship where the
graph of the relationship is a straight line as shown in the following figure.
The objective in simple linear regression is to find the relationship between two variables x and
y.
The technique involves developing a mathematical model to describe the relationship between
the variable we are trying to predict and the variable that we suspect is influencing this variable.
The variable we are trying to predict is known as the dependent variable (denoted by y) and the
variable that we would like to use to explain the dependent variable is known as the independent
variable (denoted by x).
4
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Simple Linear Regression
In our example, the expenditure on advertising would be the independent variable and sales
would be the dependent variable.
The first step in developing the mathematical relationship is to define the population model. In
the case of simple linear regression, where our objective is to explain the relationship between
the independent variable and the dependent variable, this model takes the form of:
The error term (ε) is included in the population model to account for all variables and
measurement uncertainties that we may not have included in our model but may influence the
dependent variable.
We could collect sample data for the two variables and plot them on a graph. The following
figure shows one such plot.
5
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Simple Linear Regression
The figure above shows a sample scatter plot, with a series of points scattered almost in a
straight line from the 0 point.
The scatter plot in the figure suggests that there is a linear relationship between the two
variables because the points seem to fall along a straight line.
To quantify the relationship, we fit a straight line to the scatter of points. This straight line is
called the regression line. The regression line has the form:
The regression coefficients b0 and b1 are estimates of the true population regression
coefficients β0 and β1.
The technique that we use to fit a straight-line to the data is known as the least squares method.
The resulting line is known as the line of best fit.
The line of best fit will not necessarily pass through all of our sample data points. In most cases,
there will be differences between our data points and the line of best fit as shown in the
following figure.
6
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Simple Linear Regression
Fig. 3: Residuals
This sample chart (Fig. 3) shows the difference, or sample error, between a data point on a chart,
and the line of best fit.
The differences between the data points and the line of best fit is known as residuals and are
denoted by ei.
The least squares method involves minimising the sum of squares of error (SSE).
The calculations required to determine the line of best fit are easily calculated using statistical
software such as Microsoft Excel®.
7
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Simple Linear Regression
Table 1: Annual Sales and Annual Expenditure on Advertising for the Past Five Years
If we suspect that there is a linear relationship between advertising and sales, then we can
perform a regression analysis on this data. To do this using Microsoft Excel®, we would select
Tools/Data Analysis/Regression.
The regression output for this particular example using Microsoft Excel® is shown below.
8
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Simple Linear Regression
Information:
A researcher was interested in determining whether there was a relationship between age and
salary for a group of CEO's.
Data was collected for 25 CEO's and charted as shown in the scatter plot below:
Question 1: What effect does the highest paid CEO have on the slope of the line of best fit?
A) No change
B) Increases the slope of the line of best fit
C) Decreases the slope of the line of best fit
Question 2: What would happen to the slope of the line of best fit if the highest paid CEO was
actually 50 years old instead of 70?
A) No change
B) Increases the slope of the line of best fit
C) Decreases the slope of the line of best fit
9
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Simple Linear Regression
3. Summary
Here is a quick recap of what we have learnt so far:
Simple linear regression is a technique for estimating the straight line relationship between
two variables.
The most popular approach to regression is the least squares method, which minimises the
SSE.
4. Glossary
Scatter plot A plot of line of best fit.
Line of best fit In hypothesis testing, a statistic computed from sample data
which follows a well-known distribution so that probabilities
such as the p-value can be calculated.
5. Answers
Exercise: Simple Linear Regression
Question 1: Correct answer is option B - Increases the slope of the line of best fit.
Question 2: Correct answer is option C - Decreases the slope of the line of best fit.
10
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Segment: Regression Analysis
Topic: Simple Linear Regression – Part II
Simple Linear Regression
Table of Contents
3. Prediction ........................................................................................................................................ 7
4. Summary ......................................................................................................................................... 8
2
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Simple Linear Regression
Introduction
The relationship between the dependent and independent variables can be represented
mathematically by means of the simple linear regression model. Having known about the
procedure of drawing output from Microsoft Excel, now, in this topic, we will focus on the
evaluation of the regression model. We will also study the interpreting coefficients, assessing
the model, and predicting the value of the dependent variable for the given independent
variable.
Learning Objectives
3
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Simple Linear Regression
1. Interpreting Coefficients
The line of best fit can be determined from the regression output. Below shown Excel output
example is discussed in Part I.
The coefficient b0 in this example is -49.4. This would be the y-intercept if our independent
variable x was 0. In this case, it can be interpreted as when we have no advertising (i.e., x = 0),
then we will have negative sales of US$49,400. Clearly this is not possible and is meaningless in
4
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Simple Linear Regression
this case. We need to be particularly careful when predicting outside our range of data. In this
case, we did not have any data points where advertising levels were around US$0.
The slope coefficient, b1 in this example was 34.5. We can interpret this as for each additional
thousand dollars of advertising, sales will on average increase by US$34,500.
The output from the least squares method produces the line of best fit, however, it is important
to assess how well the regression line fits the data. If the fit of the regression line is poor, we
may need to reconsider our model.
Read below to find out more about each of these. Note that all three items make use of the
Microsoft Excel® output example.
5
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Simple Linear Regression
It is essentially the standard deviation of the residuals and is a measure of how far on average
our data points are from the line of best fit.
In general, the standard error of estimate indicates the level of accuracy of predictions that
can be made from the regression equation. The smaller the standard error of estimate, the
more accurate the predictions tend to be. Comparing the standard error of estimate to the
average value of the dependent variable data is one relative measure that can be used as a
guide to assessing the model.
In this case, the standard error of estimate is 79.749 which when compared to the average
value of our y variable data of 744 is quite reasonable.
The p-value for this test is given in the Microsoft Excel® output example. In this case, it is
0.0038 or 0.38%. Since this value is small (certainly smaller than if we set α = 0.05), we can
reject the null and conclude that the slope is not 0. In other words, there seems to be a
significant relationship between sales and advertising.
6
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Simple Linear Regression
It should be noted that the p-value calculated in the Microsoft Excel® output is always a two-
tailed test. This value would need to be halved if a one-tailed test is required.
The value for the coefficient of determination can be found in the top left-hand section of the
Microsoft Excel® output example. Here, output is referred to as 'R square'.
In this example, the R2 value is 0.957 or 95.7%. This can be interpreted that advertising levels
can explain 95.7% of the variation in sales. In general, the higher the R 2 value, the more
confidence we can have in our model.
3. Prediction
Another important use of regression is the prediction of y when the x value is known. The
By substituting a particular value for the independent variable into the sample regression
equation, we can calculate a value for the dependent variable. This estimate is a point estimate.
For the example of sales and advertisement, by substituting any given value for the dependent
variable advertising, the predicted value of the sales can be obtained. If we substitute the
advertising value US $30,000 in the equation, the predicted sales will be:
= -49.4 + 34.5 * 30
7
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Simple Linear Regression
Therefore, we can say that for the advertising expenditure of US $30,000, the estimated sale is
equal to the US $986,000.
4. Summary
The standard error of estimate, a formal hypothesis of the slope and the coefficient of
The standard error of estimate indicates the level of accuracy of predictions that can be
made from the regression equation.
The value of the coefficient of determination or R2 value is between 0 and 1.
The regression model can be used for prediction.
8
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Segment: Regression Analysis
Topic: Multiple Regression
Multiple Regression
Table of Contents
2
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Multiple Regression
Introduction
In regression analysis, there may often be several independent variables that contain
information about the variable we are trying to predict or understand.
The multiple regression model allows us to consider the relationship of a particular variable with
a set of independent variables.
Learning Objectives
At the end of this topic, you will be able to:
explain the concept of multiple regression analysis
assess the utility of the multiple regression model.
3
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Multiple Regression
Where k is the number of independent variables x1, x2, … xk, that affects y.
Most of the concepts in multiple regression are similar to those in simple linear regression. For
instance, the equation that best fits the data is taken to be the one that minimises the sum of
the squares of the errors (SSE).
Consider a hypothetical example provided below in the table, which is related to the sales a
particular company. Sales can be treated as the dependent variable, whereas, the price of the
product and the advertising expenditure are the independent variables.
To determine the relationship among the variables price, advertisement, and sales for a
particular company, we collected the data for the three variables for the past ten years as
follows:
4
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Multiple Regression
We assume that there is a linear relationship between the three variables and can use multiple
linear regression to understand the relationship. Here, price and advertisement are independent
variables that can be used to predict the dependent variable sales.
The regression output for this particular example using Microsoft Excel® is shown below.
5
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Multiple Regression
The coefficient for the intercept, b0, can be interpreted similar to the simple linear regression
situation, and in this case, represents the value of when x1 and x2 are both zero.
The coefficient of x1, which is b1, represents the increase in when x1 increases by one unit
assuming that the other dependent variable x2 is held constant.
Likewise, the coefficient of x2, which is b2, represents the increase in when x2 increases by one
unit assuming that the other dependent variable x1 is held constant.
6
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Multiple Regression
The coefficient b0 in this example is – 20.74, which signified the value of when the two
independent variables x1 and x2 both are zero. Again, we know that it is meaningless and of less
significance, since at least the price cannot be zero. The slope coefficient, b1 in this example is
0.76. We can interpret this as for each additional dollar of price, sales will on average increase
by US $ 760. Similarly, the slope coefficient, b2 in this example is 0.71. We can interpret this as
for each additional thousand dollars of advertising, sales will on average increase by US$ 710.
To test the utility of the regression model we can specify the following hypothesis:
H0: β1 = β2 …βk = 0
This test is known as the F-Test and the p-value for this test can be found in the ANOVA section
of the Microsoft Excel® output.
There can be two outputs. The null hypothesis for this test is:
1. Not Rejected – it implies that none of the independent variables are linearly related to the
dependent variable and therefore the model has limited usefulness.
2. Rejected – it suggests that at least one of the coefficients of the independent variables is not
equal to zero and that the model does have some usefulness.
From the Excel output table, under the ANOVA section, it is evident that we can reject the null
hypothesis since significance F is 0.00 which is less than 0.05. This means at least one of the
coefficients of the independent variables is not equal to zero and we can use the model.
The table below ANOVA section can be used to understand the relationship between each
independent variable with the dependent variable. In order to check whether each individual
regression coefficient of the independent variables is significant, we can make use of t-test
7
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Multiple Regression
values provided in the last table of multiple regression analysis output. Here, based on the p-
value, it can be decided that a particular independent variable is to be kept or removed from the
model. The p-value will reveal whether there is a linear relationship between each pair of
dependent and independent variables. The fifth column in the table shows the p-value for the
intercept and the other independent variables. We are only interested in exploring the
relationship between dependent and independent variables.
H0: βi = 0
H1: βi ≠ 0
In our example, both the p-values for the independent variables are less than 0.05. Therefore,
we can state that variables price and advertising are significant in the regression model which
can be used to predict sales.
8
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Multiple Regression
to the regression model, just because we happen to have that data. When we complete the
regression computations, the R2 value may very well have increased from 0.75 to 0.80.
Indeed, it can be shown that R2 can only increase and not decrease.
In other words, if we assess how good a regression is solely by looking at how large the R2 value
is, we could be making a mistake. It would be helpful to have an alternative measure for this
purpose.
Consequently, when an irrelevant independent variable is added, the Adjusted R2 will most likely
decrease. Therefore, it is customary to look at the value of Adjusted R2 in addition to the R2 value
when assessing multiple regression results. The adjusted R2 adjusts for the number of predictor
terms in the model.
In our example, we can see that the R square value is 0.94 and the adjusted R square is 0.92.
9
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Multiple Regression
4. Summary
Here is a quick recap of what we have learnt so far:
Multiple regression is an analysis that relates a dependent variable to more than one
independent variable.
In assessing a multiple regression model
o the F-test can be used to test the overall utility of the model
o each separate t-tests are aimed at checking the significance of a single variable.
10
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION