Business Statistics 4-6

Download as pdf or txt
Download as pdf or txt
You are on page 1of 96

BUSINESS STATISTICS

SEGMENTS 4-6
Segment: Sampling and Estimation
Topic: Random Sampling Methods
Random Sampling Methods

Table of Contents

1. Concepts in Sampling ...................................................................................................................... 4


2. Simple Random Sampling ............................................................................................................... 5
3. Stratified Random Sampling ............................................................................................................ 6
3. Cluster Sampling ............................................................................................................................. 6
4. Comparison of Sampling Methods.................................................................................................. 7
5. Summary ......................................................................................................................................... 8
6. Glossary ........................................................................................................................................... 9
7. Answers ........................................................................................................................................... 9

2
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Random Sampling Methods

Introduction
When a population is large, it may be difficult or impossible to measure every item in the
population and calculate parameters such as the population mean or the population proportion.
In these cases, we must resort to sampling.

This topic discusses some of the concepts in sampling and methods of selecting random samples.

Learning Objectives
At the end of this topic, you will be able to:
 identify the need for sampling methods
 distinguish the different types of random sampling methods
 recognise the significance of each sampling method.

3
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Random Sampling Methods

1. Concepts in Sampling
We quite often have to make inferences about some characteristic of a population of interest.
For example, we may be interested in determining the average weekly expenditure on
entertainment per household for a given area. Another example might be if we were interested
in determining the proportion of a given population that watches a particular television program.

In each of these cases, it may be difficult or impossible to contact each member of the
population. In these cases, we need to identify a sample of the population and then obtain
information from that sample. The sample is a subset of the population that is representative of
the characteristics of the population.

Besides the difficulty in determining the population of interest, other important reasons why we
might prefer to take a sample include issues of time and cost.

The population is the set of all members from the group in which we would like to draw
inferences. Before we can take the sample, we first need a list of all members of the population.
This list is called the sample frame.

Our sample should be representative of our population. One way to ensure this is to take a
probability sample. A properly designed sampling experiment should ensure that there is no
sampling bias and that our sample is representative of the population of interest.

Read below for an example of a biased sample.

Literary Digest and the 1936 US Presidential Election Polls

An excellent example of a biased sample is that of the Literary Digest and the 1936 US
presidential election polls.

The Literary Digest held a poll that forecast that Alfred Landon would defeat Franklin
Roosevelt by 57% to 43%. The result of the election was that Roosevelt won by a landslide
victory, getting about 62% of the vote.

The problem with the poll was that they had used lists of telephone and automobile owners
to select their sample. In those days, these were luxuries, so their sample consisted mainly of

4
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Random Sampling Methods

middle and upper-class citizens. The majority of this group voted for Landon, but the lower
classes voted for Roosevelt.
Because their sample was biased towards wealthier citizens, their result was incorrect.

Source: Bowerman, B., R. O'Connell and M. Hand. Business Statistics in Practise. Boston: McGraw Hill-
Irwin, 2001.

In the following sections, we will look at a number of different random sampling techniques
that are commonly used.

2. Simple Random Sampling


The simplest type of random sample is simple random sampling. A simple random sample is one
in which each member of the population has an equal probability of being chosen.

There are two methods of taking a simple random sample:

1. One method is to give each member of the population a number and then choose the
sample, of size n, by random number tables or by using software packages.

Fig. 1: Simple Random Sampling

This graphic shows a population made of 10 people, out of which persons 2, 5, 7, and 10 are
randomly selected. This represents simple random sampling.

2. The second method is to choose every nth item from the population frame. This is also
referred to as systematic sampling.

5
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Random Sampling Methods

3. Stratified Random Sampling

When we are sampling, we would like to extract as much information from the sample as
possible. One method that can achieve this is stratified random sampling.

Stratified random sampling is a method where the population is separated into sub-populations
called strata. Then instead of taking a random sample from the entire population, we take a
simple random sample from each of the strata.

The aim when selecting strata is to try and ensure that there is as much variation as possible
between the strata but as little as possible variation within the strata.

Fig. 2: Stratified Random Sampling

3. Cluster Sampling
Another method of random sampling is cluster sampling. In this method, we separate the
population into clusters and then take simple random samples from a number of these clusters.

The clusters are usually based on geographic dimensions such as cities or suburbs.

6
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Random Sampling Methods

Fig. 3: Cluster Sampling

4. Comparison of Sampling Methods


Each method of sampling has different characteristics and typically there is a trade-off between
cost and accuracy among the various methods. The following table lists the advantages and
disadvantages of each method.

Table 1: Advantages and Disadvantages of Various Sampling Methods

7
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Random Sampling Methods

Below is an exercise to identify the three main types of probability sampling techniques.

Exercise: Sampling Methods


Question 1: We are involved in a sampling experiment, which is concerned with determining
the credit card usage of customers in a particular city. The following three methods can be
adopted to obtain this information. Can you identify which type of random sampling has been
used in each method?
Simple random Cluster sampling Stratified random
Select customers randomly
from the list of credit
Classify the customers from
different suburbs into different
groups and select random
customers
Divide the credit card
customers into different strata
based on their age and select
random customers from each
stratum

5. Summary

Here is a quick recap of what we have learnt so far:

 Samples are randomly selected from populations for the purpose of drawing inferences
about population parameters.
 Simple random sampling is a sampling method where each member of the population
has an equal probability of being chosen.
 Stratified sampling is a sampling method where the population is separated into strata
from which simple random samples are taken.
 Cluster sampling is a sampling method where the population is separated into clusters
from which simple random samples are drawn.

8
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Random Sampling Methods

6. Glossary
Sampling The process of selecting a subset of the population with a view to
study the characteristics of that population.
Population The complete set in which we are interested.
Random sample A sample where every item in the population is given an equal chance
of being selected.

7. Answers
Exercise: Sampling Methods

Simple random Cluster sampling Stratified random


Select customers randomly from Simple random
the list of credit sampling
Classify the customers from Cluster sampling
different suburbs into different
groups and select random
customers
Divide the credit card customers Stratified random
into different strata based on their sampling
age and select random customers
from each stratum

In simple random sampling, members are selected randomly from a population. In cluster
sampling, heterogenous groups are created and members selected randomly from each cluster.
Finally, in stratified random sampling, homogenous groups are created and then members are
randomly selected from each stratum.

9
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Segment: Sampling and Estimation
Topic: Sampling Distributions
Sampling Distributions

Table of Contents

1. Sampling for the Population Mean ................................................................................................. 4


2. Central Limit Theorem .................................................................................................................... 5
3. Summary ......................................................................................................................................... 6
4. Glossary ........................................................................................................................................... 7
5. Answers ........................................................................................................................................... 7

2
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Sampling Distributions

Introduction
In many business situations, we would like to estimate some population parameter of interest.
To do this, we can take a sample from the population and try and estimate the population
parameter from the sample.

In order to understand the possible outcomes, when we take our sample, we first need to
understand the concept of the sampling distribution. This concept is crucial in later situations
when we try to make inferences about the population.

In this topic, we shall investigate the concept of the sampling distribution of the sample mean.

Learning Objectives
At the end of this topic, you will be able to:
 recognise the need for the sampling distribution of the sample mean
 comprehend the central limit theorem and its practical usage.

3
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Sampling Distributions

1. Sampling for the Population Mean


Consider the following:

1. A population of size N
2. With a mean, µ
3. A standard deviation, σ

We draw samples of size n from this and we calculate sample statistics such as:

1. The sample mean,


2. The sample standard deviation, s

But, how accurate is the sample mean compared to the population mean, µ?

The sampling distribution of the sample mean is the key to understand this problem.
Understanding the concept of sampling distribution is the conceptual bridge between the
probability distribution of a population and statistical inference based on sample data. If we have
knowledge of the sampling distribution, we can answer questions about the accuracy of the
result from our sample data.

Assume for a moment that we are sampling from a population with a known µ and σ. We take a
sample of a certain size n from this population and calculate the sample mean, . Imagine that
we continue to take samples of size n from this population and calculate the sample mean. If we
then take each of these sample means and construct a histogram, it will show the probability
distribution of all possible values of . In other words, it is the sampling distribution of the
sample mean.

Here are a few observations:

1. The mean of the sampling distribution of the sample mean in each case is equal to the
population mean from which we have sampled. That is

This formula states that the population mean, subscript, is equal to the population mean.

2. The standard deviation of the sampling distribution in each case is equal to the population
standard deviation divided by the square root of the sample size. That is

4
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Sampling Distributions

This formula states that standard deviation, subscript, is equal to the standard deviation
divided by the square root of the sample size.

3. If the population is normal, then the sampling distribution of the sampling mean is also
normal.

It is extremely important to note the difference between the population standard deviation and
the standard deviation of the sample mean. Understanding this difference is the key to
understanding the concepts involved in statistical inference.

The standard deviation of the sample mean is also known as the standard error. In order to avoid
confusion, we will refer to the standard deviation of the sample mean as the standard error in
the following topics.

2. Central Limit Theorem


The central limit theorem is an important extension to our discussion of the sampling distribution
we just discussed. In the sampling simulation, we found that if the population from which we
sample is normal, then the sampling distribution of the sample mean will also be normal.
Unfortunately, this may not be helpful to us in the business environment since many populations
we sample from are not normal or cannot be assumed to be normal.

A remarkable discovery called the central limit theorem solves this problem though. It is arguably
one of the most important theorems in the development of statistical inference.

The central limit theorem states that even if the population from which we are sampling is not
normal, as long as our sample size is sufficiently large, the sampling distribution of the sample
mean will be approximately normal.

The larger the sample size, the more closely the sampling distribution will resemble a normal
distribution.

In practice, a sample size of 30 or more is usually sufficient to ensure that our sampling
distribution will be approximately normal.

5
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Sampling Distributions

Below is an exercise to practice what you have learnt about the central limit theorem.

Exercise: Central Limit Theorem

A researcher is trying to estimate the salaries of MBA graduates two years after graduation.
Suppose that, unknown to the researcher, the distribution of the salaries of these graduates
is right-skewed with a mean of US$100,000 and a standard deviation of US$60,000.
Question 1: If the researcher sampled 30 students, which of the following would best
represent the sampling distribution of the sample mean:
1. Approximately normal with a mean of US$100,000 and a standard deviation of US$60,000
2. A T-distribution with a mean of US$100,000 and a standard deviation US$10,954
3. Approximately normal with a mean of US$100,000 and a standard deviation of US$10,954
4. A T-distribution with a mean of US$100,000 and a standard deviation of US$60,000

Question 2: Now, if the researcher sampled 100 students, which of the following would best
represent the sampling distribution of the sample mean?
1. Approximately normal with a mean of US$100,000 and a standard deviation of US$10,954
2. Approximately normal with a mean of US$60,000 and a standard deviation US$6,000
3. Approximately normal with a mean of US$100,000 and a standard deviation of US$6000
4. Approximately normal with a mean of US$60,000 and a standard deviation of US$10,954

3. Summary

Here is a quick recap of what we have learnt so far:

 The sampling distribution of the sample mean describes the probability distribution for
the possible values of our sample mean.
 If our population is normal then our sampling distribution will also be normal, regardless
of the sample size, with a mean equal to the population mean.
 The standard deviation of the sample mean is also known as the standard error.
 The central limit theorem states that regardless of the shape of the population, as long
as our sample size is sufficiently large, the sampling distribution of the sample mean will
be approximately normal.

6
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Sampling Distributions

4. Glossary
Sample A set of items selected from a population. Most of the time we are
interested in random samples, where the selection of item is
carried out giving equal chance of selection to every item in the
population.
Standard deviation The standard deviation of a data set is the positive square root of
the variance.
Population The complete set in which we are interested.
Probability The probability of an event is the likelihood of occurrence of that
event.

5. Answers
Exercise: Central Limit Theorem
Question 1:

The correct answer is option 3. Approximately normal with a mean of US$100,000 and a
standard deviation of US$10,954.
Since the sample size is greater than 30, the central limit theorem says that the sampling
distribution of the sample mean will be approximately normal with a mean of US$100,000 and
a standard deviation of US$10,954 (given by sigma divided by the square root of n).

Question 2:

The correct answer is option 3. Approximately normal with a mean of US$100,000 and a
standard deviation of US$6000.

Since the sample size is greater than 30, the central limit theorem says that the sampling
distribution of the sample mean will be approximately normal with a mean of US$100,000 and
a standard deviation of US$6,000 (given by sigma divided by the square root of n).

7
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Segment: Sampling and Estimation
Topic: Confidence Interval for a Mean
Confidence Interval for a Mean

Table of Contents

1. Estimation Concepts ....................................................................................................................... 4


2. Confidence Interval for a Mean (σ Known)..................................................................................... 4
3. Confidence Interval for a Mean (σ Unknown) ................................................................................ 6
4. Sample Size Determination ............................................................................................................. 8
5. Summary ....................................................................................................................................... 10
6. Glossary ......................................................................................................................................... 10
7. Answers ......................................................................................................................................... 10

2
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Confidence Interval for a Mean

Introduction
In many situations, we need to make inferences about population parameters of interest. One
inference that we may be interested in making is an estimate of the population mean. To do this,
we need to make use of our sampling distribution.

In this topic, we discuss estimation and develop the concept of the interval estimate for a
population mean.

Learning Objectives
At the end of this topic, you will be able to:
 describe the two estimation methods
 use the T-distribution to identify the best estimate when the population standard deviation
is unknown
 illustrate how to determine the sample size.

3
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Confidence Interval for a Mean

1. Estimation Concepts
The aim of the estimation is to determine the value of a population parameter of interest. For
example, the sample mean is an estimate of the true population mean.

There are two types of estimates, namely:

1. Point estimate
2. Interval estimate

Read below to learn more about the two types of estimates.

Point estimates
A point estimate is our 'best guess' of the true population parameter. Suppose we are trying
to estimate the average income of students at a particular university, we could do this by
taking a sample of a certain size and calculating the sample mean. This sample mean is a point
estimate and is our best guess of the true value. While this estimate might be suitable, it does
not take into account all the information in the sample we have collected.
For example, one other sample statistic we could calculate is the sample standard deviation.
In addition, we have no indication of how accurate our 'estimate' is compared to the true (and
unknown) population mean.

Interval estimate
An alternative approach is the interval estimate. In this method, we specify an interval over
which we have a degree of confidence that the true parameter lies. To do this, we need to
make use of our sampling distribution.

2. Confidence Interval for a Mean (σ Known)


Recall that the sampling distribution tells where we are likely to find our sample mean given that
we know the population mean and population standard deviation. For example, given a
population with a mean of µ and a standard deviation of σ, the central limit theorem states that
as long as our sample size is large enough, the sampling distribution will be approximately normal
with a mean of µ and a standard deviation (standard error) of

4
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Confidence Interval for a Mean

This formula states that standard deviation, subscript, is equal to standard deviation divided
by the square root of n.

Given that the sampling distribution is approximately normal, we can state that there is
approximately a 68% chance that our sample mean will fall in the range

This formula states that if the sampling distribution is approximately normal the sample mean
will fall in the range of the population mean plus or minus one multiplied by the standard
deviation subscript.

Likewise, there is approximately a 95% chance that the sample mean will fall between

This formula states that the sample mean will fall between the population mean plus or minus
two multiplied by the standard deviation subscript.

In general, our equation becomes

This formula states that is equal to the population mean plus or minus Z multiplied by the
standard deviation subscript.

Where Z is the number of standard errors each side of the mean.

This equation can then be rearranged in order to obtain the general form of the interval estimate
or confidence interval as shown below:

5
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Confidence Interval for a Mean

This formula states that the population mean is equal to plus or minus Z multiplied by the
standard deviation subscript.

Then, to obtain the confidence interval for the population mean, all we need to do is specify the
confidence level and we can derive the confidence interval. Typical confidence levels are 90%,
95%, or 99%, with 95% being the most commonly used.

For example, suppose we are interested in determining the 95% confidence interval for the
average amount that inner-city workers in a particular city spend on coffee each week. Assuming
that we know that the population standard deviation for this population is US$15. If we take a
sample of 100 people and find that the average of this sample is US$20, we can then calculate
the confidence interval of interest.

The 95% confidence interval would then become

This formula states that the population mean is equal to plus or minus Z multiplied by the
standard deviation subscript. This is equal to 20 plus or minus one multiplied by the nine six
point 15 divided by the square root of 100 which is then equal to 20 plus or minus 2.94.

In other words, we are 95% confident that the interval US$20 ± US$2.94 created contains the
true value of the population mean for the average weekly spending on coffee.

It should also be noted that in the example discussed, there is a 5% chance that the interval we
have created will not contain the true population parameter.

3. Confidence Interval for a Mean (σ Unknown)


One problem typically arises in our calculation of the confidence interval is we rarely know the
value of population standard deviation, σ.

6
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Confidence Interval for a Mean

In this case, the next best option is to use the sample standard deviation, s, calculated from
sample data. However, in doing this we introduce additional variability and we can no longer
guarantee that our sampling distribution will be normal.

Instead, the sampling distribution follows a T-distribution. The T-distribution is similar to the
normal distribution but has the following properties:

1. It has a mean of 0.

2. It has a standard deviation that varies with the sample size. We can specify the degree of
freedom, which is the sample size minus one.

Some examples of the T-distribution are shown in the following graph.

Fig. 1: Examples of T-distributions

Three different T-distributions are shown, all going from minus 4 to plus 4 on the x-axis, and, at
the zero point on the x-axis, they are all at varying heights on the y-axis. On the y-axis, the green
curve goes to zero point three the red curve goes to zero point three five and the blue curve
goes to zero point 4.

In using the sample standard deviation, the general form of the confidence interval becomes

7
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Confidence Interval for a Mean

This formula states that the population mean is equal to plus or minus t multiplied by sample
standard deviation, subscript .

Where
This formula states that sample standard deviation, subscript is equal to sample standard
deviation divided by the square root of the sample population.

The T-value indicates the number of sample standard errors by which the sample mean differs
from the population mean. If we specify the confidence level, we can then find an appropriate
T-value for our confidence interval equation.

4. Sample Size Determination


So far we have assumed that we have taken a sample of a certain size and calculated the
confidence interval for our population mean.

One question that will arise is how big a sample should we take in the first place?

Recall that our general form of the confidence interval is

In most business situations, we would like to limit our uncertainty in our estimate of the
population mean.

If we let the maximum error we are willing to tolerate be represented by B, then it can be shown
that the required sample size is

Where

 z is the number of standard errors associated with a given confidence level (1.96 if we
assume a 95% confidence interval).

8
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Confidence Interval for a Mean

 σest is our estimate of the population standard deviation.


 B is the required maximum error in our estimate.

Exercise
Below is an exercise to practice what you have learnt on confidence intervals.

Exercise - Confidence Intervals


You have to estimate the average rental price of a two-bedroom unit in a particular city with
a 95% confidence level. Assume that the mean of this distribution is US$500 with a standard
deviation of US$150 dollars.

Question 1: Given the above what is the sample size required if we wish to estimate the
average rental price of a two-bedroom unit to within US$20?
1. 158
2. 190
3. 200
4. 217

Question 2: Given the above what is the sample size required if we wish to estimate the
average rental price of a two-bedroom unit to within 5% of the true value?
1. 100
2. 105
3. 110
4. 139

Question 3: Given the above what is the sample size required in Question 1 if the standard
deviation in the price of two-bedroom units was twice as large (i.e., US$300)?
1. 550
2. 800
3. 825
4. 865

9
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Confidence Interval for a Mean

5. Summary
Here is a quick recap of what we have learnt so far:

 Estimates can be point estimates or interval estimates.


 A point estimate is the 'best guess' of the true population parameter. In this approach, a
sample of a certain size is used for calculating the sample mean.
 An interval estimate for a mean gives a degree of confidence that our population mean lies
in the confidence interval generated.
 If we do not know the population standard deviation, then we need to use the sample
standard deviation as our best estimate.
 The resulting sampling distribution follows a T-distribution. The required sample size can be
calculated if we know how much uncertainty we can have in our estimate.
 Microsoft Excel® can be used to calculate confidence intervals.

6. Glossary
Estimation The process of inferring the value of an unknown parameter using
sampling.

7. Answers
Exercise: Confidence Intervals

Question 1: Correct answer is option 4, 217.


Question 2: Correct answer is option 4, 139.
Question 3: Correct answer is option 4, 865.

10
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Segment: Sampling and Estimation
Topic: Confidence Interval for a Proportion
Confidence Interval for a Proportion

Table of Contents

1. Sampling Distribution for a Proportion ........................................................................................... 4


2. Confidence Interval for a Proportion .............................................................................................. 6
3. Sample Size Determination ............................................................................................................. 7
4. Summary ......................................................................................................................................... 9
5. Answers ........................................................................................................................................... 9

2
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Confidence Interval for a Proportion

Introduction
When dealing with qualitative data, we can determine the proportion of times that a value of
interest occurs. The parameter of interest in these cases is the population proportion.

We might be interested in making inferences about the population proportion. This can be a
point estimate such as the sample proportion or an interval estimate as already discussed.

In this topic, we will discuss the concept of interval estimate for a proportion.

Learning Objectives
At the end of this topic, you will be able to:
• describe the sampling distribution for a population proportion
• determine the confidence interval for a population proportion
• determine the required sample size when estimating proportions.

3
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Confidence Interval for a Proportion

1. Sampling Distribution for a Proportion


If our data type is qualitative, then the summary measure we are interested in is the proportion
of times that our value occurs. Surveys are often used to estimate proportions with the
parameter of interest being the population proportion, p.

The relationship between the population proportion and the sample proportion is shown in the
following figure.

Fig. 1: Schematic Diagram of Sampling for a Population Proportion

This graphic shows the relationship between the population proportion and the sample
proportion. A big square represents the population, with an arrow pointing to a small oval shape
which represents the random sample.

For example, suppose that we were interested in launching a new product and decided to
conduct a survey to try and find out the proportion of consumers who would be interested in
buying our product. From the survey results, we can obtain the proportion of consumers in our
sample who said that they would buy our new product. This proportion is known as the sample
proportion. Our objective is to use this result to make an inference about the true population
proportion.

The sample proportion is a point estimate of the population proportion and is represented as

4
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Confidence Interval for a Proportion

This formula states that the sample proportion is equal to x divided by n.

Where x is the number of successes and n is the number of trials.

Overall, the true population proportion is p, from which we take a sample of size n. The number
of successes in a sample of size n is x. Therefore, the proportion of successes in a sample is

The number of successes (in this case the number of consumers who said that they would buy
the product) is a binomial random variable with

 a mean of E(x) = np
 a standard deviation of

Rather than consider the number of successes, it is far easier to talk about the proportion of
success. It can be shown as long as n is reasonably large and np > 5 and n(1- p) > 5, that the mean
of is given by

And the standard deviation of is given by

The sampling distribution of the sample proportion is approximately normally distributed, with
a mean of p, and a standard deviation is given by the square root of p(1- p)/n.

As for the case of mean, the standard deviation of the sample proportion is also known as the
standard error.

5
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Confidence Interval for a Proportion

2. Confidence Interval for a Proportion


The procedure for obtaining the confidence interval for a population proportion is very similar
to the procedure for obtaining the confidence interval for the mean. The confidence interval is
our point estimate plus or minus some number of standard errors.

The confidence interval is based on a large sample size. As already discussed, the required
conditions are

If these conditions are satisfied, then the sampling distribution will be approximately normal,
and the confidence interval is given by

The appropriate Z-multiple will depend on the level of confidence we require. For the 95%
confidence interval, the Z-value will be 1.96.

As an example, suppose that we are conducting a survey of 400 households to find the
proportion of those that have purchased a high definition television. Assuming our survey found
that 60 of these have made such a purchase, estimate with 99% confidence the true proportion
of households in the population of interest that have made this purchase.

The point estimate, in this case, is 60/400 or 0.15. The interval estimate or confidence interval
is given by

We first check to see if the conditions for the normal approximation are satisfied as follows:

6
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Confidence Interval for a Proportion

In both cases the conditions are satisfied.

Since we are after the 99% confidence interval, we are after the Z-value such that the area in
both tails is equal to 0.01. In other words, the area in each tail is 0.005. Using the NORMSINV
function in Microsoft Excel®, the corresponding Z-value can be found as

This value is negative since we have calculated the Z-value corresponding to the left-hand tail.
We are only interested in the magnitude of this number.

Therefore the 99% confidence interval is given by

In other words, the true population proportion of households that have purchased a high
definition television lies between the intervals of 10.4% to 19.6%.

3. Sample Size Determination

As for the case with means, we need to decide on how big a sample to take when we are
estimating proportions. In order to do this, we again have to specify the maximum amount of
uncertainty that we are willing to tolerate in our estimate of the true population proportion.

Recall the formula for the confidence interval for a proportion. The right-hand side term after
point estimate is the uncertainty term. The uncertainty in our estimate of the population

7
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Confidence Interval for a Proportion

proportion is given by plus or minus Z multiplied by the square root of the sample proportion
multiplied by open bracket one minus the sample proportion, close bracket divided by n.

If we let the maximum error, we are willing to tolerate be denoted by B, then it can be shown
that the required sample size is given by

This formula states that the required sample size, n, is given by Z squared multiplied by the
sample proportion multiplied by open bracket one minus the sample proportion, close bracket
divided by B squared.

Where

 Z is a value corresponding to our desired level of confidence


 B is the required error
 p is an estimate of the true population proportion

In most cases, we do not have any knowledge of the true population proportion. In these cases,
it is best to use a value of 0.5.

Below is an exercise to practice estimating sample size for proportions.


Exercise: Sample Size for Proportions
Suppose that we were interested in determining the proportion of people in a particular city
that have a digital camera. We would like to estimate this proportion to within plus or minus
3%.

Question: Based on the information above, if we believe that the population proportion is
20%, what sample size would be required to estimate the proportion of people in this city that
own a digital camera?
1. 453
2. 500

8
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Confidence Interval for a Proportion

3. 633
4. 683

4. Summary

Here is a quick recap of what we have learnt so far:

 When dealing with qualitative data, it is generally the population proportion that is of
interest.
 It can be shown that as long as n is large enough such that np and n(1- p) are greater than or
equal to 5, then the sampling distribution of the sample proportion is approximately normal,
with a mean of p and a standard deviation given by the square root of p(1- p)/n.
 As for means, confidence intervals can be developed for the population proportion.
 The required sample size can be determined based on certain assumptions.

5. Answers
Exercise: Sample Size for Proportions
The correct answer is option 4, 683.
The answer is calculated as:

Where z=1.96 (for 95%confidence interval and e is the error in %). The numbers become

rounded up.

9
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Segment: Hypothesis Testing
Topic: Concepts in Hypothesis Testing
Concepts in Hypothesis Testing

Table of Contents

1. Null and Alternative Hypotheses .................................................................................................... 4


2. One-Tailed and Two-Tailed Tests .................................................................................................... 5
3. Type I and Type II Errors ................................................................................................................. 6
4. Significance Level and p-Value ........................................................................................................ 7
5. Power of a Test ............................................................................................................................. 10
6. Summary ....................................................................................................................................... 10
7. Glossary ......................................................................................................................................... 11

2
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Concepts in Hypothesis Testing

Introduction

A hypothesis is a tentative explanation for an observation or phenomenon that has not yet been
verified. Hypothesis testing is the process of determining whether or not a given hypothesis is
consistent with observed facts.

This topic introduces the concepts involved with hypothesis testing and some of the issues
involved in testing the hypothesis.

Learning Objectives
At the end of this topic, you will be able to:
 define null and alternative hypotheses
 distinguish between one and two-tailed tests
 recognise the significance level and calculate the p-value
 examine Type II errors and the power of a test.

3
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Concepts in Hypothesis Testing

1. Null and Alternative Hypotheses


We often need to make inferences about a population of interest based on sample data. The
inferences often involve making a decision about a particular theory or hypothesis that we would
like to test.

Example
For instance, a manufacturer of light bulbs might claim that the average life of their bulbs is
at least 8,000 hours. Based on the results of a sampling experiment conducted on the bulbs,
it is possible to calculate the probability of observing the result obtained (or more extreme)
from our sampling experiment if the claim under test is true. Depending on the results
obtained, the manager can then decide whether or not to reject the claim.
Before we can test the particular theory or hypothesis though, we need to understand the
basic concepts involved in hypothesis testing.

For the case of the claim made by the manufacturer of light bulbs about the average life of
the bulbs, we have:

It could happen that someone else, such as a consumer advocate, could make a counterclaim
that µ < 8,000. If so, which claim do we take as the null hypothesis?

It is customary to formulate the null hypothesis such that if it were true, then no special action
would be necessary, and if the alternative is true, then some special action would be
necessary.
This approach would favour the H0 and H1 as outlined above. The burden of proof is usually
on the alternative hypothesis H1. It is up to the researcher to provide enough evidence in

4
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Concepts in Hypothesis Testing

support of the alternative; otherwise, we must continue to believe that the null hypothesis is
true.
In the above case, if there was enough evidence in favour of the alternative hypothesis (i.e., µ <
8,000), then some action, such as requiring the manufacturer to revoke the claim and pay for
damages caused might be necessary.

Two important issues that should be kept in mind are:


1. Tests are only performed on population parameters
2. There is always an 'equals' sign in the null hypothesis. Note that this could be '=', '<' or '>'.

2. One-Tailed and Two-Tailed Tests


Hypothesis tests can be either one-tailed or two-tailed, depending on what we are trying to
prove.
Read below to find out which test method to use.

5
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Concepts in Hypothesis Testing

Hypothesis Testing Methods

One-tailed hypothesis test


A one-tailed hypothesis is one where the only sample results which can lead to rejection of
the null hypothesis are those in a particular direction. One-tailed alternatives are phrased in
terms of '>' or '<'.
For example, if the manufacturer was interested in whether or not the average life of the bulbs
was more than 8,000 hours, then the hypothesis might be set-up as:
H0: μ ≤ 8,000 hours
H1: μ > 8,000 hours
This is one-tailed.

Two-tailed hypothesis test


A two-tailed test is one where results in either of two directions can lead to rejection of the
null hypothesis.
Two-tailed alternatives are phrased in terms of '≠'.
For example, if the manufacturer of the light bulbs was only interested in whether the average
life of their bulbs was 8,000 hours, then the hypothesis might be set-up as a two-tailed test:
H0: μ = 8,000 hours
H1: μ ≠ 8,000 hours

Once the hypotheses are set-up, it is easy to detect whether the test is one-tailed or two-tailed.

The real question is whether to set-up a hypothesis for a particular problem as one-tailed or two-
tailed. There is no statistical answer to this question. It depends entirely on what we are trying
to prove.

3. Type I and Type II Errors


An analyst can either reject or fail to reject the null hypothesis that he or she is making.
Hopefully, the analyst will make the right decision, but it needs to be recognized that a mistake
could be made. The outcomes of a decision are shown in the figure below.

6
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Concepts in Hypothesis Testing

Fig. 1: Outcomes of Decisions Made When the Null Hypothesis is Rejected or Accepted

As can be seen from this figure, if we reject a null hypothesis that was false then we have made
a correct decision. Likewise, if we fail to reject a null hypothesis that was true, then we have also
made the correct decision.

However, we need to recognise that there are two types of decisions where we are making an
error:

1. The mistake of rejecting a true H0 is called a Type I error


2. The mistake of accepting a false H0 is called a Type II error

For example, suppose a bank manager was interested in whether or not the mean waiting time
for customers had increased from its previous value. In this case, the null hypothesis might be
that the mean waiting time has not changed. If the manager sampled a group of customers and
performed a hypothesis test on the results, the possible errors that the manager could make
are:

1. Type I error: Concluding that the average waiting time had increased when it had not
2. Type II error: Concluding that the average waiting time was the same, when in fact it had
increased

4. Significance Level and p-Value


The real question is how strong the evidence in favour of the alternative hypothesis must be in
order to reject the null hypothesis. To do this, we need to consider the concepts of the
significance level and the p-value.

Read below to find out which test method to use.

7
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Concepts in Hypothesis Testing

Significance levels
The researcher must determine the maximum probability of a Type I error that he or she is
willing to tolerate. This value is denoted by α, the significance level of the test, and is most
commonly equal to 0.05, although α = 0.01 and α = 0.10 are also frequently used. Then, given
the value of α, we can use statistical theory to determine the rejection region. If the sample
evidence falls into this region, we reject the null hypothesis; otherwise, we do not reject it.
Sample evidence that falls into the rejection region is called statistically significant at the α
level.
p-value
An alternative approach avoids the use of the significance level and the rejection region and
instead simply reports how significant the sample evidence is. We can do this by calculating
the p-value of the hypothesis test. The p-value of a sample is the probability of seeing a sample
with at least as much evidence in favor of the alternative hypothesis as the sample actually
observed. The smaller the p-value, the more evidence there is in favor of the alternative
hypothesis.

Example:
Suppose the manufacturer of the light bulbs in the example above was interested in testing
whether the lifetime of the bulbs was more than 8,000 hours. The manufacturer might then
take a sample of these bulbs and find that the average life of the sample of bulbs was 8,750
hours. The p-value, in this case, is the probability that we could get a sample mean of 8,750
hours or more, assuming that the true average life of the bulbs was 8,000 hours.

In general, smaller p-values indicate more evidence in support of the alternative hypothesis. If a
p-value is sufficiently small, almost any decision-maker will conclude that rejecting the null
hypothesis is the more reasonable decision.

How small is a 'small' p-value?

This is largely a matter of semantics but if the

 p-value is less than 0.01, it provides convincing evidence that the alternative hypothesis is
true.

8
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Concepts in Hypothesis Testing

 p-value is between 0.01 and 0.05, there is strong evidence in favour of the alternative
hypothesis.
 p-value is between 0.05 and 0.10, it is in a grey area.
 p-values greater than 0.10 are interpreted as weak or no evidence in support of the
alternative.

Test statistic

To compute the p-value, we use a test statistic computed from the sample data.

The test statistic is a random variable whose distribution is well known so that certain desired
probabilities can be calculated.

For instance, consider the case where we are testing a statement about µ using a sample size of
at least 30 (n >= 30) and σ is known.

Sampling theory tells us that the quantity:

will follow a Z-distribution.

Thus, the test statistic in this case will be:

Now, consider if σ is unknown and the population is normal. Then, the test statistic would be:

where the sample standard deviation, S, has been substituted in place of the population
standard deviation, σ. Fortunately, p-values for a variety of statistical tests are easily calculated
using Excel and other softwares.

9
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Concepts in Hypothesis Testing

5. Power of a Test
The probability that we could commit a Type II error is denoted by β. A Type II error is the
probability that we will fail to reject the null hypothesis in the situation where the null hypothesis
is actually false. Given that, the probability that we would correctly reject a null hypothesis that
was actually false is given by 1 - β. This probability is known as the power of a test.

Example:
Suppose that you were the marketing manager for a pharmaceutical company that has
developed a new drug, and your company was making claims that the drug was safe. In
medical trials to test this claim, the null hypothesis would be that the drug was safe.

If we were to commit a Type II error during these tests, we would be failing to reject a null
hypothesis that was actually false. In other words, we would be saying that the drug was safe
when in fact it was not safe. In this case, we would be committing a serious error.

In this case, we would like to minimise the probability of making a Type II error. To do that,
we would like to increase the power of the test. One popular method of increasing the power
of the test is to increase the sample size.

6. Summary
Here is a quick recap of what we have learnt so far:

 A null hypothesis is a statement about a population parameter that can be tested.


 The logical opposite of the null hypothesis is the alternative hypothesis.
 Depending on the way the null hypothesis is set-up, the rejection region occurs on either
one or both tails of the probability distribution of the test statistic.
 Correspondingly, the test is either a one-tailed test or a two-tailed test.
 In any test, there will be chances of Type I and Type II errors occurring.

10
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Concepts in Hypothesis Testing

7. Glossary
Confidence level In interval estimation, it is the probability that the value being
estimated lies within the estimated interval. In hypothesis
testing, it is the complement of the level of significance.
Test statistic In hypothesis testing, a statistic computed from sample data
which follows a well-known distribution so that probabilities
such as the p-value can be calculated.

11
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Segment: Hypothesis Testing
Topic: Hypothesis Tests for a Population Mean and
Population Proportion
Hypothesis Tests for a Population Mean and Population Proportion

Table of Contents

1. Testing a Population Mean .............................................................................................................. 4


2. Testing a Population Proportion ...................................................................................................... 6
3. Summary ......................................................................................................................................... 9

2
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Hypothesis Tests for a Population Mean and Population Proportion

Introduction

There are many instances in business where we need to decide about a single population. For
example, a claim might be made that
 The average number of complaints made by customers has increased above the usual
level.
 A company's product does not weigh the same amount as stated on the packaging.
 The average time spent online by internet users has increased. A sample of internet users
could be taken, and a hypothesis test conducted.
In these cases, inferences can be made about a population parameter of interest.

When our data is qualitative in nature, the population parameter of interest would be the
population proportion. The point estimator of this parameter is the sample proportion, which
under certain conditions has an approximately normal sampling distribution.

In this topic, we will explore the concept of testing a population mean and a population
proportion.

Learning Objectives

At the end of this topic, you will be able to:


 test a mean with an unknown population standard deviation
 illustrate Hypothesis test for population proportion.

3
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Hypothesis Tests for a Population Mean and Population Proportion

1. Testing a Population Mean

In hypothesis tests of a mean, we are testing to see if our sample data is consistent with our
hypothesised mean. In order to do this, we need to consider the sampling distribution of the
sampling mean. If we know the population standard deviation and our sample size is large
enough, then the sampling distribution of the sample mean will follow a normal distribution.

In practice though, we rarely know the value of the population standard deviation, and as in
estimation, if σ is unknown, we use s as the replacement. Our sampling distribution will then
follow a T-distribution with n - 1 degrees of freedom.

Example:
Suppose an entrepreneur was considering a location for an electricity-producing windmill
farm. To be successful, the average wind speed at that particular location should be greater
than 32 km/h. In order to test the location, the entrepreneur arranged for 50 wind speed
measurements to be taken at the location on randomly selected days over a period of time.
The average wind speed from the sample of 50 days was 38 km/h with a sample standard
deviation of 25 km/h.
Should the entrepreneur go ahead with the windmill farm?

In this case, we are testing a mean with an unknown population standard deviation, so the
sampling distribution will follow a T-distribution. The null and alternative hypotheses that the
entrepreneur would be interested in testing are:
H0: µ < 32
H1: µ > 32
Assuming the null hypothesis to be true, the sampling distribution is as shown in the following
figure.

4
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Hypothesis Tests for a Population Mean and Population Proportion

Fig. 1: Sampling Distribution of Mean When Null Hypothesis is True

The null hypothesis is that the population mean is less than or equal to 32, and the alternative
hypothesis is that the population mean is greater than 32.

The p-value is the probability that we could have obtained our sample result as extreme or more
extreme if the null hypothesis is true. In this case, it is the probability in the tail to the right of
our observed value of 38 km/h since our test is a one-tailed test.

In this case, the test statistic can be calculated as:

The p-value can then be found using Excel's TDIST function:


= TDIST (1.697, 49, 1)
= 0.04802 = 4.8%.

This is a reasonably small p-value. Our decision in this case would be that there seems to be
sufficient evidence to reject the null hypothesis and conclude that the average wind speed is
greater than 32 km/h at this location.

5
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Hypothesis Tests for a Population Mean and Population Proportion

In other cases, we may be interested in performing a two-tailed test. In these cases, the p-value
is the probability in both tails.

2. Testing a Population Proportion

As we already know, as long as our sample size is reasonably large such that np and n(1 - p) are
greater than or equal to 5, the sampling distribution of the sample proportion is approximately
normally distributed with a mean of p and a standard deviation given by the square root of
p(1 - p)/n. Note that the term n refers to the sample size while the term p refers to the population
proportion.

Similar to hypothesis tests for a population mean, hypothesis tests can also be performed for
population proportions. For proportions, we can calculate the sample proportion from our
sample data and then use this to calculate our p-value for the test.

Following is an example for performing hypothesis tests for population proportions.

Example:
A company is considering introducing a new product and has determined that it needs to
capture a market share of 10% to break even. The product will be profitable for the company
if it can capture more than 10% of the market.

Suppose that the company surveyed 400 potential customers to ask them if they would buy
the product. Suppose also that they received positive responses from 52of these people who
said they would purchase the product.

Given this data, is there enough evidence to suggest that the company should proceed with
the launch of the product?

Solution
The parameter of interest, in this case, is the population proportion of potential customers
who would buy the product. The company has decided that they need to capture more than
10% of the market, so in this case, the null and alternative hypotheses would be:

6
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Hypothesis Tests for a Population Mean and Population Proportion

H0: p < 0.1


H1: p > 0.1
The sample size was 400 with np = 40 and n (1 - p) = 360. Since np and n (1 - p) are both
greater than 5, the sampling distribution will be approximately normally distributed with:
E(p) = p = 0.1
and

The sample proportion in this case is 52/400 = 0.13. The following figure shows the sampling
distribution assuming the null hypothesis to be true and p-value for the test.

Fig. 2: Sampling Distribution of Proportion When Null Hypothesis is True

The p-value is the probability that we could have obtained our sample result as extreme or more
extreme if the null hypothesis is true. In this case, it is the probability in the tail to the right of our
observed proportion of 13% since our test is a one tailed test.

In this case, the test statistic can be calculated as:

7
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Hypothesis Tests for a Population Mean and Population Proportion

The test statistic Z is equal to the sample proportion minus the hypothesised population
proportion divided by the square root of the population proportion multiplied by q (1 minus
the population proportion) divided by the sample size which is equal to 0.13 minus 0.1 divided
by the square root of 0.1 multiplied by 0.9 divided by 400 which is equal to 2.

The p-value can then be found using Excel® NORMDIST or NORMSDIST functions:
= 1 - NORMDIST (0.13, 0.1, 0.015, 1)
= 0.02275 = 2.275%

This is a reasonably small p-value. Our decision, in this case, would be to reject the null
hypothesis and conclude that the true population proportion of consumers who would buy
the product is greater than 10%.

As for means, we may be interested in performing a two-tailed test for proportions. In these
cases, the p-value is calculated as the probability in both tails.

8
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Hypothesis Tests for a Population Mean and Population Proportion

3. Summary

Here is a quick recap of what we have learnt so far:

 Hypothesis tests for a population mean can be either one-tailed or two-tailed.


 If the population standard deviation is known and the sample size n > 30, then the test
statistic will follow a normal distribution.
 If the population standard deviation is not known, then the sample standard deviation must
be used. If we assume the population is normally distributed, the resulting test statistic will
follow a T-distribution with n - 1 degrees of freedom.
 Hypothesis tests for a population proportion can be either one-tailed or two-tailed.
 If np and n (1 - p) are both greater than 5, then the sampling distribution will be
approximately normally distributed.

9
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Segment: Hypothesis Testing
Topic: Testing Differences between Population
Means
Testing Differences between Population Means

Table of Contents

1. Comparing Population Averages..................................................................................................... 4


2. Paired Difference Method .............................................................................................................. 4
3. Independent Samples Method ....................................................................................................... 5
4. Using Excel for Testing Differences ................................................................................................. 8
5. Summary ....................................................................................................................................... 12
6. Answers ......................................................................................................................................... 12

2
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Testing Differences between Population Means

Introduction

We quite often have to test for differences between two population means. Depending on the
circumstances, the samples can be either independent or paired.

In this topic, we will explore the concept of testing for differences between two population
means.

Learning Objectives

At the end of this topic, you will be able to:


 distinguish between the paired difference method and the independent sample method
used when performing hypothesis tests between two means
 use Microsoft Excel® to perform hypothesis tests for the differences between two means.

3
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Testing Differences between Population Means

1. Comparing Population Averages

We saw the case where the light bulb manufacturer claimed that the average life of its light
bulbs was at least 8,000 hours. What if the claim was that their bulbs last longer than another
brand XYZ?

Now it is irrelevant what the average value is; what matters is the difference between the
averages of two brands.

Using the sampling theory, we have investigated so far, we can compare two population
averages in two different ways:

1. Paired difference method


2. Independent samples method

As you may guess, these methods are applicable under certain conditions. In this topic, we
shall explore the details of the two methods and their applicability.

2. Paired Difference Method

This method is applicable when we are comparing a sample pair of items. For example, if we
wish to compare the effectiveness of a training program on employees, we can measure the
performance of the employee before and after the training. It does make sense to compare
the two measurements and take the difference. The difference can be attributed to the
effectiveness of the training. The t-test for the difference between means from paired samples
is used. The advantage of this method is that the effects of extraneous variables are controlled
and thus measured differences are less prone to error. As a result, the test is more reliable.

Following are the notations used for the comparison of two populations:

Notation for the claimed difference in the population means is D subscript zero.

4
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Testing Differences between Population Means

Notation for the average of the sample differences is D bar.

Notation for the standard deviation of the sample differences is italic S subscript D.

Notations for the claimed difference in the population means is italic n.

The test statistic for the paired observation test is:

When the null hypothesis

is true, and the two populations are normally distributed, the quantity will
follow a T-distribution with (n – 1) degrees of freedom. As a result, we conduct a T-test,
meaning the test statistic is t.

3. Independent Samples Method

If a paired difference test is possible in a situation, then that must be the preferred method
since it is more reliable. But a paired difference is not always possible.

Suppose we wish to compare the average number of sick days taken by employees in two
departments of a large company over the period of a year.

In this case, we need to take one random sample from the first department and another
independent sample from the other department. We then have to take the difference between
the averages of the two samples.

The random variable of interest is the difference between the two sample means, denoted by:

5
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Testing Differences between Population Means

Using sampling theory, we can infer the following:

1.
2. Since and are independent random variables, the variance of their difference
should be the sum of their individual variances. In other words:

3. If both populations are normally distributed then will also be normally


distributed. As a result, the quantity calculated from the formula below will follow a Z-
distribution.

4. Even if the two populations are not normally distributed, as n1 and n2 increases, the

distribution of will approach a normal distribution. Practically, if n1 and n2

are both at least 30, we can assume is normally distributed.

The above results enable us to conduct a Z-test when either the populations are normally
distributed, or the samples are large.

But there is a twist. The computation of Z calls for σ12 and σ22. Most of the time, they are not
known.

We then have to resort to a T-test. In addition, there are two cases:

 case 1: σ12 = σ22 (The variances are equal)


 case 2: σ12 ≠ σ22 (The variances are not equal)

The formulas for the sampling distributions are slightly different for the two cases. While we
will be using Excel® to perform these calculations for us, it is instructive to be aware of the
calculations involved.

6
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Testing Differences between Population Means

The following table outlines the calculations.

If σ12 and σ22are unknown but equal, Then the quantity:

Open bracket, numerator is open bracket x bar subscript one minus x bar subscript two close
brackets minus D subscript zero. Denominator is square root of S subscript p squared divided
by n subscript one, plus S subscript p squared divided by n subscript two.

7
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Testing Differences between Population Means

If the variances of the two populations can be assumed to be equal, then the equal variance
method is preferred. In cases where the variances are clearly different, then the unequal
variance case will need to be implemented. It should also be noted that there is a hypothesis
test that can be performed to test whether the two variances are equal.

4. Using Excel for Testing Differences

Suppose that we were interested in determining whether there was any difference between
the number of sick days used by employees in two departments in a large firm over the period
of a year. The following table shows the results of the random sample of 20 employees taken
from each of the two departments.

8
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Testing Differences between Population Means

Table 1: Random Sample of 20 Employees

Since the samples are independent, we need to perform an independent samples test for the
difference between the two means.

Before we can do the test though, we need to determine whether the variances in the two
groups are the same. One way to do this might be to generate summary statistics for each
group and make an informed decision. Alternatively, Excel® provides a statistical test for this
comparison known as the F-test. This test can be found under Data/Data Analysis/F-test: Two-
Sample for Variances. The following table shows the Excel® output from this test.

Table 2: F-test: Two-Sample for Variances

9
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Testing Differences between Population Means

The p-value, in this case, is slightly less than 0.05, which might lead us to reject the null
hypothesis and conclude that there is sufficient evidence to conclude that the variances are
significantly different from each other.

If we were to conclude this, then the appropriate T-test for the difference between the two
means would be performed. This test can be found in Excel® under Data/Data Analysis/t-test:
Two-Sample Assuming Unequal Variances.

The Excel® output for this test is shown below:

Table 3: T-test: Two-Sample Assuming Unequal Variances

If we were performing a two-tailed analysis, then, as can be seen from the Excel® output, the
p-value is 0.172. In this case, the p-value is relatively large, and we would not reject the null
hypothesis. We would conclude in this instance that there was insufficient evidence to
conclude that there was a difference between the average number of sick days used by
employees in the two departments.

Below is an exercise to practice what you have learnt on testing means.

Exercise: Testing Differences in Population Means


The owner of a city car park suspects that the person she hired to run the car park is stealing
some money. The receipts as provided by the employee indicate that the average number of
cars parked is 125 per day and that, on average, each car is parked for 3.5 hours. In order to
determine whether the employee is stealing, the owner watches the car park on five

10
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Testing Differences between Population Means

randomly chosen days. On those days the average number of cars parked is 120, 130, 124,
127, 128.
For the total number of cars that the owner observed during the five days, the mean and the
standard deviation of the time spent at the car park were 3.6 and 0.4 hours respectively.

Question 1: Which of the following would best represent the hypothesis test the owner
would be interested in testing?
1. H0 : µ = 125, H1 : µ = 125 and H0 : µ ≥ 3.5, H1 : µ < 3.5
2. H0 : µ ≤ 125, H1 : µ > 125 and H0 : µ ≤ 3.5, H1 : µ > 3.5
3. H0 : µ ≥ 125, H1 : µ < 125 and H0 : µ ≥ 3.5, H1 : µ < 3.5
4. H0 : µ = 125, H1 : µ < 125 and H0 : µ ≥ 3.5, H1 : µ < 3.5

Question 2: Suppose that the p-values obtained for the tests were 0.002 and 0.6
respectively. The owner concludes that the employee is stealing. Is this statement true or
false?
1. True
2. False

11
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Testing Differences between Population Means

5. Summary

 Population averages can be compared in two ways, namely, Paired difference method
and Independent samples method

 If the variances of the two populations can be assumed to be equal, then the equal
variance method is preferred.

 If the variances are clearly different, then the unequal variance case will need to be
implemented.

 Hypothesis test that can be performed to test whether the two variances are equal.

 Hypothesis tests for the differences between two means can be performed using
Excel®.

6. Answers
Exercise: Testing Differences in Population Means

Question 1: The correct answer is option 2, H0: µ ≤ 125, H1: µ > 125 and H0: µ ≤ 3.5, H1: µ >
3.5

Question 2: The correct answer is option 1, True.

12
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Segment: Regression Analysis
Topic: Simple Linear Regression – Part I
Simple Linear Regression

Table of Contents

1. Simple Linear Regression Model ..................................................................................................... 4


2. Least Squares Method .................................................................................................................... 5
3. Summary ....................................................................................................................................... 10
4. Glossary ......................................................................................................................................... 10
5. Answers ......................................................................................................................................... 10

2
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Simple Linear Regression

Introduction
Regression analysis is one of the most important and widely used statistical techniques. It has
many applications in business and economics.

We saw that there are two types of regression. In this topic, we will discuss simple linear
regression in detail. In simple linear regression, we model the relationship between two
variables.

Learning Objectives
At the end of this topic, you will be able to:
 comprehend the simple linear regression model and its significance
 use the least squares method to obtain the regression equation.

3
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Simple Linear Regression

1. Simple Linear Regression Model


In many business situations, we would like to investigate and quantify the relationship between
two or more variables. For example, we may be interested in determining how expenditure on
advertising affects sales for a particular company.

If so, it would be useful to quantify their relationship so that we can predict one of them by
knowing the value of the others, or control one of them by controlling the others.

A simple type of relationship between two variables x and y is the linear relationship where the
graph of the relationship is a straight line as shown in the following figure.

Fig. 1: Simple Linear Relationship – Straight-line Graph


The figure given above shows a straight line graph. The distance between the zero point and the
point of intersection on the y-axis is called the intercept, while the angle of the line is called the
slope.

The objective in simple linear regression is to find the relationship between two variables x and
y.

The technique involves developing a mathematical model to describe the relationship between
the variable we are trying to predict and the variable that we suspect is influencing this variable.

The variable we are trying to predict is known as the dependent variable (denoted by y) and the
variable that we would like to use to explain the dependent variable is known as the independent
variable (denoted by x).

4
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Simple Linear Regression

In our example, the expenditure on advertising would be the independent variable and sales
would be the dependent variable.

The first step in developing the mathematical relationship is to define the population model. In
the case of simple linear regression, where our objective is to explain the relationship between
the independent variable and the dependent variable, this model takes the form of:

The error term (ε) is included in the population model to account for all variables and
measurement uncertainties that we may not have included in our model but may influence the
dependent variable.

2. Least Squares Method


Suppose we suspect that a linear relationship exists between x and y and wish to quantify the
relationship in the form of an equation.

We could collect sample data for the two variables and plot them on a graph. The following
figure shows one such plot.

Fig. 2: Sample Scatter Plot

5
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Simple Linear Regression

The figure above shows a sample scatter plot, with a series of points scattered almost in a
straight line from the 0 point.

The graph is called a scatter plot since it shows a scatter of points.

The scatter plot in the figure suggests that there is a linear relationship between the two
variables because the points seem to fall along a straight line.

To quantify the relationship, we fit a straight line to the scatter of points. This straight line is
called the regression line. The regression line has the form:

The regression coefficients b0 and b1 are estimates of the true population regression
coefficients β0 and β1.

The technique that we use to fit a straight-line to the data is known as the least squares method.
The resulting line is known as the line of best fit.

The line of best fit will not necessarily pass through all of our sample data points. In most cases,
there will be differences between our data points and the line of best fit as shown in the
following figure.

6
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Simple Linear Regression

Fig. 3: Residuals

This sample chart (Fig. 3) shows the difference, or sample error, between a data point on a chart,
and the line of best fit.

The differences between the data points and the line of best fit is known as residuals and are
denoted by ei.

The residuals are defined as:

The least squares method involves minimising the sum of squares of error (SSE).

The sum of squares of error is defined as:

The calculations required to determine the line of best fit are easily calculated using statistical
software such as Microsoft Excel®.

As an example, suppose we were interested in determining the relationship between advertising


and sales for a particular company and had collected annual sales and annual expenditure on
advertising for the past five years as follows:

7
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Simple Linear Regression

Table 1: Annual Sales and Annual Expenditure on Advertising for the Past Five Years

If we suspect that there is a linear relationship between advertising and sales, then we can
perform a regression analysis on this data. To do this using Microsoft Excel®, we would select
Tools/Data Analysis/Regression.

The regression output for this particular example using Microsoft Excel® is shown below.

Fig. 4: Regression Output


The output contains a number of calculations that can be used to determine the line of best fit
and to assess how well the line fits the data. We will explore this output in the following sections.

8
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Simple Linear Regression

Below is an exercise to practice what you know about outliers.

Exercise: Simple Linear Regression


Read the given information before attempting the questions below.

Information:
A researcher was interested in determining whether there was a relationship between age and
salary for a group of CEO's.
Data was collected for 25 CEO's and charted as shown in the scatter plot below:

Fig. 5: Scatter Plot for 25 CEO’s Age and Salary

Question 1: What effect does the highest paid CEO have on the slope of the line of best fit?
A) No change
B) Increases the slope of the line of best fit
C) Decreases the slope of the line of best fit

Question 2: What would happen to the slope of the line of best fit if the highest paid CEO was
actually 50 years old instead of 70?
A) No change
B) Increases the slope of the line of best fit
C) Decreases the slope of the line of best fit

9
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Simple Linear Regression

3. Summary
Here is a quick recap of what we have learnt so far:

 Simple linear regression is a technique for estimating the straight line relationship between
two variables.
 The most popular approach to regression is the least squares method, which minimises the
SSE.

4. Glossary
Scatter plot A plot of line of best fit.
Line of best fit In hypothesis testing, a statistic computed from sample data
which follows a well-known distribution so that probabilities
such as the p-value can be calculated.

5. Answers
Exercise: Simple Linear Regression

Question 1: Correct answer is option B - Increases the slope of the line of best fit.
Question 2: Correct answer is option C - Decreases the slope of the line of best fit.

10
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Segment: Regression Analysis
Topic: Simple Linear Regression – Part II
Simple Linear Regression

Table of Contents

1. Interpreting Coefficients ................................................................................................................. 4

2. Assessing the Model ....................................................................................................................... 5

3. Prediction ........................................................................................................................................ 7

4. Summary ......................................................................................................................................... 8

2
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Simple Linear Regression

Introduction

The relationship between the dependent and independent variables can be represented
mathematically by means of the simple linear regression model. Having known about the
procedure of drawing output from Microsoft Excel, now, in this topic, we will focus on the
evaluation of the regression model. We will also study the interpreting coefficients, assessing
the model, and predicting the value of the dependent variable for the given independent
variable.

Learning Objectives

At the end of this topic, you will be able to:


 evaluate the fit of a regression model
 use the regression model for prediction.

3
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Simple Linear Regression

1. Interpreting Coefficients

The line of best fit can be determined from the regression output. Below shown Excel output
example is discussed in Part I.

Fig. 1: Regression Output

The resulting regression line can then be expressed as:

The coefficient b0 in this example is -49.4. This would be the y-intercept if our independent
variable x was 0. In this case, it can be interpreted as when we have no advertising (i.e., x = 0),
then we will have negative sales of US$49,400. Clearly this is not possible and is meaningless in

4
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Simple Linear Regression

this case. We need to be particularly careful when predicting outside our range of data. In this
case, we did not have any data points where advertising levels were around US$0.

The slope coefficient, b1 in this example was 34.5. We can interpret this as for each additional
thousand dollars of advertising, sales will on average increase by US$34,500.

2. Assessing the Model

The output from the least squares method produces the line of best fit, however, it is important
to assess how well the regression line fits the data. If the fit of the regression line is poor, we
may need to reconsider our model.

In order to evaluate the model, we can assess three items:

1. The standard error of estimate


2. A hypothesis test of the slope
3. The coefficient of determination

Read below to find out more about each of these. Note that all three items make use of the
Microsoft Excel® output example.

Fig. 2: Regression Output

5
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Simple Linear Regression

Evaluating the Regression Model

Standard error of estimate


The first item, standard error of estimate is included in the top left-hand section of the
Microsoft Excel® output example.

It is essentially the standard deviation of the residuals and is a measure of how far on average
our data points are from the line of best fit.

In general, the standard error of estimate indicates the level of accuracy of predictions that
can be made from the regression equation. The smaller the standard error of estimate, the
more accurate the predictions tend to be. Comparing the standard error of estimate to the
average value of the dependent variable data is one relative measure that can be used as a
guide to assessing the model.

In this case, the standard error of estimate is 79.749 which when compared to the average
value of our y variable data of 744 is quite reasonable.

Hypothesis test of the slope


The second item to evaluate the usefulness of a model is the hypothesis test of the slope, of
the sample regression line. The hypotheses in this case are:
H0 : β1 = 0
H1 : β1 ≠ 0
The null hypothesis indicates that there is no relationship between the variables.

The p-value for this test is given in the Microsoft Excel® output example. In this case, it is
0.0038 or 0.38%. Since this value is small (certainly smaller than if we set α = 0.05), we can
reject the null and conclude that the slope is not 0. In other words, there seems to be a
significant relationship between sales and advertising.

6
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Simple Linear Regression

It should be noted that the p-value calculated in the Microsoft Excel® output is always a two-
tailed test. This value would need to be halved if a one-tailed test is required.

Coefficient of determination (R2)


The value of the coefficient of determination or R2 value is between 0 and 1. This value can be
interpreted that the proportion or percentage of the variation in the dependent variable that
can be explained by the independent variable.

The value for the coefficient of determination can be found in the top left-hand section of the
Microsoft Excel® output example. Here, output is referred to as 'R square'.

In this example, the R2 value is 0.957 or 95.7%. This can be interpreted that advertising levels
can explain 95.7% of the variation in sales. In general, the higher the R 2 value, the more
confidence we can have in our model.

3. Prediction

Another important use of regression is the prediction of y when the x value is known. The

predicted value of y is denoted by and is calculated as:

By substituting a particular value for the independent variable into the sample regression
equation, we can calculate a value for the dependent variable. This estimate is a point estimate.

For the example of sales and advertisement, by substituting any given value for the dependent
variable advertising, the predicted value of the sales can be obtained. If we substitute the
advertising value US $30,000 in the equation, the predicted sales will be:

= -49.4 + 34.5 * 30

= 985.6 (approx. 986)

7
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Simple Linear Regression

Therefore, we can say that for the advertising expenditure of US $30,000, the estimated sale is
equal to the US $986,000.

4. Summary

Here is a quick recap of what we have learnt so far:

 The standard error of estimate, a formal hypothesis of the slope and the coefficient of

determination R2 can be used to evaluate the fit of the regression model.

 The standard error of estimate indicates the level of accuracy of predictions that can be
made from the regression equation.
 The value of the coefficient of determination or R2 value is between 0 and 1.
 The regression model can be used for prediction.

8
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Segment: Regression Analysis
Topic: Multiple Regression
Multiple Regression

Table of Contents

1. The Multiple Regression Model ...................................................................................................... 4


2. Interpreting the Coefficients ........................................................................................................... 6
3. Assessing the Model ....................................................................................................................... 7
3.1 Test for Significance of Regression ......................................................................................... 7
3.2 Test for Individual Regression Coefficients (t-tests) ............................................................... 7
3.3 R2 and Adjusted R2 ................................................................................................................... 8
4. Summary ....................................................................................................................................... 10

2
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Multiple Regression

Introduction

In regression analysis, there may often be several independent variables that contain
information about the variable we are trying to predict or understand.

The multiple regression model allows us to consider the relationship of a particular variable with
a set of independent variables.

Learning Objectives
At the end of this topic, you will be able to:
 explain the concept of multiple regression analysis
 assess the utility of the multiple regression model.

3
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Multiple Regression

1. The Multiple Regression Model


Often the dependent variable y depends on more than one independent variable. It is then
necessary to relate y to all the independent variables in one equation. For instance, the sales of
a product may be influenced not only by the marketing budget, but also by its price, its quality,
the economy, or the degree of competition. Multiple regression is an analytical method that
relates a dependent variable to more than one independent variable.

The objective is to get an equation of the form:

Where k is the number of independent variables x1, x2, … xk, that affects y.

Most of the concepts in multiple regression are similar to those in simple linear regression. For
instance, the equation that best fits the data is taken to be the one that minimises the sum of
the squares of the errors (SSE).

Consider a hypothetical example provided below in the table, which is related to the sales a
particular company. Sales can be treated as the dependent variable, whereas, the price of the
product and the advertising expenditure are the independent variables.

To determine the relationship among the variables price, advertisement, and sales for a
particular company, we collected the data for the three variables for the past ten years as
follows:

4
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Multiple Regression

Table 1: Sales of a Company

Sales Price Advertisement


(US $ ‘000) (US $) (US $ ‘000)
88 117 33
90 111 37
100 120 34
62 93 16
72 97 28
94 112 38
74 88 33
73 96 32
76 99 35
93 116 38

We assume that there is a linear relationship between the three variables and can use multiple
linear regression to understand the relationship. Here, price and advertisement are independent
variables that can be used to predict the dependent variable sales.

The regression output for this particular example using Microsoft Excel® is shown below.

Fig. 1: Regression Output

5
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Multiple Regression

2. Interpreting the Coefficients


The interpretation of the regression coefficients in the multiple regression model is similar to
the case for simple linear regression although a little extra care is required when interpreting
the coefficients.

For example, suppose our sample regression equation is in the form:

The coefficient for the intercept, b0, can be interpreted similar to the simple linear regression

situation, and in this case, represents the value of when x1 and x2 are both zero.

The coefficient of x1, which is b1, represents the increase in when x1 increases by one unit
assuming that the other dependent variable x2 is held constant.

Likewise, the coefficient of x2, which is b2, represents the increase in when x2 increases by one
unit assuming that the other dependent variable x1 is held constant.

Fig. 2: Regression Output


In our example, we can write the regression equation using the Excel output as given below:

6
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Multiple Regression

The coefficient b0 in this example is – 20.74, which signified the value of when the two
independent variables x1 and x2 both are zero. Again, we know that it is meaningless and of less
significance, since at least the price cannot be zero. The slope coefficient, b1 in this example is
0.76. We can interpret this as for each additional dollar of price, sales will on average increase
by US $ 760. Similarly, the slope coefficient, b2 in this example is 0.71. We can interpret this as
for each additional thousand dollars of advertising, sales will on average increase by US$ 710.

3. Assessing the Model

3.1 Test for Significance of Regression

To test the utility of the regression model we can specify the following hypothesis:

H0: β1 = β2 …βk = 0

H1: At least one βi is not equal to zero

This test is known as the F-Test and the p-value for this test can be found in the ANOVA section
of the Microsoft Excel® output.

There can be two outputs. The null hypothesis for this test is:

1. Not Rejected – it implies that none of the independent variables are linearly related to the
dependent variable and therefore the model has limited usefulness.
2. Rejected – it suggests that at least one of the coefficients of the independent variables is not
equal to zero and that the model does have some usefulness.

From the Excel output table, under the ANOVA section, it is evident that we can reject the null
hypothesis since significance F is 0.00 which is less than 0.05. This means at least one of the
coefficients of the independent variables is not equal to zero and we can use the model.

3.2 Test for Individual Regression Coefficients (t-tests)

The table below ANOVA section can be used to understand the relationship between each
independent variable with the dependent variable. In order to check whether each individual
regression coefficient of the independent variables is significant, we can make use of t-test

7
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Multiple Regression

values provided in the last table of multiple regression analysis output. Here, based on the p-
value, it can be decided that a particular independent variable is to be kept or removed from the
model. The p-value will reveal whether there is a linear relationship between each pair of
dependent and independent variables. The fifth column in the table shows the p-value for the
intercept and the other independent variables. We are only interested in exploring the
relationship between dependent and independent variables.

The null and alternative hypothesis can be defined as

H0: βi = 0

H1: βi ≠ 0

Where βi indicates the slope of ith variable.

In our example, both the p-values for the independent variables are less than 0.05. Therefore,
we can state that variables price and advertising are significant in the regression model which
can be used to predict sales.

Fig. 3: Regression Output

3.3 R2 and Adjusted R2


As with simple linear regression, the coefficient of determination is denoted by R2. A weakness
with R2 is that it can only increase as another variable is added to the model, even if that variable
is totally irrelevant.

8
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Multiple Regression

For example, suppose a regression of:

Now, if we add an irrelevant independent variable such as:

to the regression model, just because we happen to have that data. When we complete the
regression computations, the R2 value may very well have increased from 0.75 to 0.80.

Indeed, it can be shown that R2 can only increase and not decrease.

In other words, if we assess how good a regression is solely by looking at how large the R2 value
is, we could be making a mistake. It would be helpful to have an alternative measure for this
purpose.

Such a measure is the adjusted R2.

R2 is defined as: R2 = 1 – SSE / SST

and adjusted R2 is defined as: Adjusted R2 = 1 – MSE / MST

Thus, the adjusted R2 pays attention to the degrees of freedom.

Consequently, when an irrelevant independent variable is added, the Adjusted R2 will most likely
decrease. Therefore, it is customary to look at the value of Adjusted R2 in addition to the R2 value
when assessing multiple regression results. The adjusted R2 adjusts for the number of predictor
terms in the model.

In our example, we can see that the R square value is 0.94 and the adjusted R square is 0.92.

9
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Multiple Regression

4. Summary
Here is a quick recap of what we have learnt so far:

 Multiple regression is an analysis that relates a dependent variable to more than one
independent variable.
 In assessing a multiple regression model
o the F-test can be used to test the overall utility of the model
o each separate t-tests are aimed at checking the significance of a single variable.

10
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION

You might also like