Business Statistics 4-6

BUSINESS STATISTICS

SEGMENTS 4-6
Segment: Sampling and Estimation
Topic: Random Sampling Methods
Random Sampling Methods

Table of Contents

1. Concepts in Sampling ...................................................................................................................... 4

2. Simple Random Sampling ............................................................................................................... 5
3. Stratified Random Sampling ............................................................................................................ 6
3. Cluster Sampling ............................................................................................................................. 6
4. Comparison of Sampling Methods.................................................................................................. 7
5. Summary ......................................................................................................................................... 8
6. Glossary ........................................................................................................................................... 9
7. Answers ........................................................................................................................................... 9
2
©COPYRIGHT 2019, ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION
Introduction
When a population is large, it may be difficult or impossible to measure every item in the
population and calculate parameters such as the population mean or the population proportion.
In these cases, we must resort to sampling.
This topic discusses some of the concepts in sampling and methods of selecting random samples.
Learning Objectives
At the end of this topic, you will be able to:
 identify the need for sampling methods
 distinguish the different types of random sampling methods
 recognise the significance of each sampling method.
3
1. Concepts in Sampling
We quite often have to make inferences about some characteristic of a population of interest.
For example, we may be interested in determining the average weekly expenditure on
entertainment per household for a given area. Another example might be if we were interested
in determining the proportion of a given population that watches a particular television program.
In each of these cases, it may be difficult or impossible to contact each member of the
population. In these cases, we need to identify a sample of the population and then obtain
information from that sample. The sample is a subset of the population that is representative of
the characteristics of the population.
Besides the difficulty in determining the population of interest, other important reasons why we
might prefer to take a sample include issues of time and cost.
The population is the set of all members from the group in which we would like to draw
inferences. Before we can take the sample, we first need a list of all members of the population.
This list is called the sample frame.
Our sample should be representative of our population. One way to ensure this is to take a
probability sample. A properly designed sampling experiment should ensure that there is no
sampling bias and that our sample is representative of the population of interest.
Read below for an example of a biased sample.
Literary Digest and the 1936 US Presidential Election Polls
An excellent example of a biased sample is that of the Literary Digest and the 1936 US
presidential election polls.
The Literary Digest held a poll that forecast that Alfred Landon would defeat Franklin
Roosevelt by 57% to 43%. The result of the election was that Roosevelt won by a landslide
victory, getting about 62% of the vote.
The problem with the poll was that they had used lists of telephone and automobile owners
to select their sample. In those days, these were luxuries, so their sample consisted mainly of
4
middle and upper-class citizens. The majority of this group voted for Landon, but the lower
classes voted for Roosevelt.
Because their sample was biased towards wealthier citizens, their result was incorrect.
Source: Bowerman, B., R. O'Connell and M. Hand. Business Statistics in Practise. Boston: McGraw Hill-
Irwin, 2001.
In the following sections, we will look at a number of different random sampling techniques
that are commonly used.
2. Simple Random Sampling

The simplest type of random sample is simple random sampling. A simple random sample is one
in which each member of the population has an equal probability of being chosen.
There are two methods of taking a simple random sample:
1. One method is to give each member of the population a number and then choose the
sample, of size n, by random number tables or by using software packages.
Fig. 1: Simple Random Sampling
This graphic shows a population made of 10 people, out of which persons 2, 5, 7, and 10 are
randomly selected. This represents simple random sampling.
2. The second method is to choose every nth item from the population frame. This is also
referred to as systematic sampling.
5
3. Stratified Random Sampling
When we are sampling, we would like to extract as much information from the sample as
possible. One method that can achieve this is stratified random sampling.
Stratified random sampling is a method where the population is separated into sub-populations
called strata. Then instead of taking a random sample from the entire population, we take a
simple random sample from each of the strata.
The aim when selecting strata is to try and ensure that there is as much variation as possible
between the strata but as little as possible variation within the strata.
Fig. 2: Stratified Random Sampling
3. Cluster Sampling
Another method of random sampling is cluster sampling. In this method, we separate the
population into clusters and then take simple random samples from a number of these clusters.
The clusters are usually based on geographic dimensions such as cities or suburbs.
6
Fig. 3: Cluster Sampling
4. Comparison of Sampling Methods

Each method of sampling has different characteristics and typically there is a trade-off between
cost and accuracy among the various methods. The following table lists the advantages and
disadvantages of each method.
Table 1: Advantages and Disadvantages of Various Sampling Methods
7
Below is an exercise to identify the three main types of probability sampling techniques.
Exercise: Sampling Methods

Question 1: We are involved in a sampling experiment, which is concerned with determining
the credit card usage of customers in a particular city. The following three methods can be
adopted to obtain this information. Can you identify which type of random sampling has been
used in each method?
Simple random Cluster sampling Stratified random
Select customers randomly
from the list of credit
Classify the customers from
different suburbs into different
groups and select random
customers
Divide the credit card
customers into different strata
based on their age and select
random customers from each
stratum
5. Summary
Here is a quick recap of what we have learnt so far:
 Samples are randomly selected from populations for the purpose of drawing inferences
about population parameters.
 Simple random sampling is a sampling method where each member of the population
has an equal probability of being chosen.
 Stratified sampling is a sampling method where the population is separated into strata
from which simple random samples are taken.
 Cluster sampling is a sampling method where the population is separated into clusters
from which simple random samples are drawn.
8
6. Glossary
Sampling The process of selecting a subset of the population with a view to
study the characteristics of that population.
Population The complete set in which we are interested.
Random sample A sample where every item in the population is given an equal chance
of being selected.
7. Answers
Exercise: Sampling Methods
Simple random Cluster sampling Stratified random

Select customers randomly from Simple random
the list of credit sampling
Classify the customers from Cluster sampling
different suburbs into different
groups and select random
customers
Divide the credit card customers Stratified random
into different strata based on their sampling
age and select random customers
from each stratum
In simple random sampling, members are selected randomly from a population. In cluster
sampling, heterogenous groups are created and members selected randomly from each cluster.
Finally, in stratified random sampling, homogenous groups are created and then members are
randomly selected from each stratum.
9
Topic: Sampling Distributions
Sampling Distributions
Table of Contents
1. Sampling for the Population Mean ................................................................................................. 4

2. Central Limit Theorem .................................................................................................................... 5
3. Summary ......................................................................................................................................... 6
4. Glossary ........................................................................................................................................... 7
5. Answers ........................................................................................................................................... 7
2
Introduction
In many business situations, we would like to estimate some population parameter of interest.
To do this, we can take a sample from the population and try and estimate the population
parameter from the sample.
In order to understand the possible outcomes, when we take our sample, we first need to
understand the concept of the sampling distribution. This concept is crucial in later situations
when we try to make inferences about the population.
In this topic, we shall investigate the concept of the sampling distribution of the sample mean.
Learning Objectives
 recognise the need for the sampling distribution of the sample mean
 comprehend the central limit theorem and its practical usage.
3
1. Sampling for the Population Mean

Consider the following:
1. A population of size N
2. With a mean, µ
3. A standard deviation, σ
We draw samples of size n from this and we calculate sample statistics such as:
1. The sample mean,

2. The sample standard deviation, s
But, how accurate is the sample mean compared to the population mean, µ?
The sampling distribution of the sample mean is the key to understand this problem.
Understanding the concept of sampling distribution is the conceptual bridge between the
probability distribution of a population and statistical inference based on sample data. If we have
knowledge of the sampling distribution, we can answer questions about the accuracy of the
result from our sample data.
Assume for a moment that we are sampling from a population with a known µ and σ. We take a
sample of a certain size n from this population and calculate the sample mean, . Imagine that
we continue to take samples of size n from this population and calculate the sample mean. If we
then take each of these sample means and construct a histogram, it will show the probability
distribution of all possible values of . In other words, it is the sampling distribution of the
sample mean.
Here are a few observations:
1. The mean of the sampling distribution of the sample mean in each case is equal to the
population mean from which we have sampled. That is
This formula states that the population mean, subscript, is equal to the population mean.
2. The standard deviation of the sampling distribution in each case is equal to the population
standard deviation divided by the square root of the sample size. That is
4
This formula states that standard deviation, subscript, is equal to the standard deviation
divided by the square root of the sample size.
3. If the population is normal, then the sampling distribution of the sampling mean is also
normal.
It is extremely important to note the difference between the population standard deviation and
the standard deviation of the sample mean. Understanding this difference is the key to
understanding the concepts involved in statistical inference.
The standard deviation of the sample mean is also known as the standard error. In order to avoid
confusion, we will refer to the standard deviation of the sample mean as the standard error in
the following topics.
2. Central Limit Theorem

The central limit theorem is an important extension to our discussion of the sampling distribution
we just discussed. In the sampling simulation, we found that if the population from which we
sample is normal, then the sampling distribution of the sample mean will also be normal.
Unfortunately, this may not be helpful to us in the business environment since many populations
we sample from are not normal or cannot be assumed to be normal.
A remarkable discovery called the central limit theorem solves this problem though. It is arguably
one of the most important theorems in the development of statistical inference.
The central limit theorem states that even if the population from which we are sampling is not
normal, as long as our sample size is sufficiently large, the sampling distribution of the sample
mean will be approximately normal.
The larger the sample size, the more closely the sampling distribution will resemble a normal
distribution.
In practice, a sample size of 30 or more is usually sufficient to ensure that our sampling
distribution will be approximately normal.
5
Below is an exercise to practice what you have learnt about the central limit theorem.
Exercise: Central Limit Theorem
A researcher is trying to estimate the salaries of MBA graduates two years after graduation.
Suppose that, unknown to the researcher, the distribution of the salaries of these graduates
is right-skewed with a mean of US$100,000 and a standard deviation of US$60,000.
Question 1: If the researcher sampled 30 students, which of the following would best
represent the sampling distribution of the sample mean:
1. Approximately normal with a mean of US$100,000 and a standard deviation of US$60,000
2. A T-distribution with a mean of US$100,000 and a standard deviation US$10,954
4. A T-distribution with a mean of US$100,000 and a standard deviation of US$60,000
Question 2: Now, if the researcher sampled 100 students, which of the following would best
represent the sampling distribution of the sample mean?
2. Approximately normal with a mean of US$60,000 and a standard deviation US$6,000
3. Approximately normal with a mean of US$100,000 and a standard deviation of US$6000
3. Summary
 The sampling distribution of the sample mean describes the probability distribution for
the possible values of our sample mean.
 If our population is normal then our sampling distribution will also be normal, regardless
of the sample size, with a mean equal to the population mean.
 The standard deviation of the sample mean is also known as the standard error.
 The central limit theorem states that regardless of the shape of the population, as long
as our sample size is sufficiently large, the sampling distribution of the sample mean will
be approximately normal.
6
4. Glossary
Sample A set of items selected from a population. Most of the time we are
interested in random samples, where the selection of item is
carried out giving equal chance of selection to every item in the
population.
Standard deviation The standard deviation of a data set is the positive square root of
the variance.
Population The complete set in which we are interested.
Probability The probability of an event is the likelihood of occurrence of that
event.
5. Answers
Exercise: Central Limit Theorem
Question 1:
The correct answer is option 3. Approximately normal with a mean of US$100,000 and a
standard deviation of US$10,954.
Since the sample size is greater than 30, the central limit theorem says that the sampling
distribution of the sample mean will be approximately normal with a mean of US$100,000 and
a standard deviation of US$10,954 (given by sigma divided by the square root of n).
Question 2:
The correct answer is option 3. Approximately normal with a mean of US$100,000 and a
standard deviation of US$6000.
Since the sample size is greater than 30, the central limit theorem says that the sampling
distribution of the sample mean will be approximately normal with a mean of US$100,000 and
a standard deviation of US$6,000 (given by sigma divided by the square root of n).
7
Topic: Confidence Interval for a Mean
Confidence Interval for a Mean
Table of Contents
1. Estimation Concepts ....................................................................................................................... 4

2. Confidence Interval for a Mean (σ Known)..................................................................................... 4
3. Confidence Interval for a Mean (σ Unknown) ................................................................................ 6
4. Sample Size Determination ............................................................................................................. 8
5. Summary ....................................................................................................................................... 10
6. Glossary ......................................................................................................................................... 10
7. Answers ......................................................................................................................................... 10
2
Introduction
In many situations, we need to make inferences about population parameters of interest. One
inference that we may be interested in making is an estimate of the population mean. To do this,
we need to make use of our sampling distribution.
In this topic, we discuss estimation and develop the concept of the interval estimate for a
population mean.
Learning Objectives
 describe the two estimation methods
 use the T-distribution to identify the best estimate when the population standard deviation
is unknown
 illustrate how to determine the sample size.
3
1. Estimation Concepts
The aim of the estimation is to determine the value of a population parameter of interest. For
example, the sample mean is an estimate of the true population mean.
There are two types of estimates, namely:
1. Point estimate
2. Interval estimate
Read below to learn more about the two types of estimates.
Point estimates
A point estimate is our 'best guess' of the true population parameter. Suppose we are trying
to estimate the average income of students at a particular university, we could do this by
taking a sample of a certain size and calculating the sample mean. This sample mean is a point
estimate and is our best guess of the true value. While this estimate might be suitable, it does
not take into account all the information in the sample we have collected.
For example, one other sample statistic we could calculate is the sample standard deviation.
In addition, we have no indication of how accurate our 'estimate' is compared to the true (and
unknown) population mean.
Interval estimate
An alternative approach is the interval estimate. In this method, we specify an interval over
which we have a degree of confidence that the true parameter lies. To do this, we need to
make use of our sampling distribution.
2. Confidence Interval for a Mean (σ Known)

Recall that the sampling distribution tells where we are likely to find our sample mean given that
we know the population mean and population standard deviation. For example, given a
population with a mean of µ and a standard deviation of σ, the central limit theorem states that
as long as our sample size is large enough, the sampling distribution will be approximately normal
with a mean of µ and a standard deviation (standard error) of
4
This formula states that standard deviation, subscript, is equal to standard deviation divided
by the square root of n.
Given that the sampling distribution is approximately normal, we can state that there is
approximately a 68% chance that our sample mean will fall in the range
This formula states that if the sampling distribution is approximately normal the sample mean
will fall in the range of the population mean plus or minus one multiplied by the standard
deviation subscript.
Likewise, there is approximately a 95% chance that the sample mean will fall between
This formula states that the sample mean will fall between the population mean plus or minus
two multiplied by the standard deviation subscript.
In general, our equation becomes
This formula states that is equal to the population mean plus or minus Z multiplied by the
standard deviation subscript.
Where Z is the number of standard errors each side of the mean.
This equation can then be rearranged in order to obtain the general form of the interval estimate
or confidence interval as shown below:
5
This formula states that the population mean is equal to plus or minus Z multiplied by the
standard deviation subscript.
Then, to obtain the confidence interval for the population mean, all we need to do is specify the
confidence level and we can derive the confidence interval. Typical confidence levels are 90%,
95%, or 99%, with 95% being the most commonly used.
For example, suppose we are interested in determining the 95% confidence interval for the
average amount that inner-city workers in a particular city spend on coffee each week. Assuming
that we know that the population standard deviation for this population is US$15. If we take a
sample of 100 people and find that the average of this sample is US$20, we can then calculate
the confidence interval of interest.
The 95% confidence interval would then become
This formula states that the population mean is equal to plus or minus Z multiplied by the
standard deviation subscript. This is equal to 20 plus or minus one multiplied by the nine six
point 15 divided by the square root of 100 which is then equal to 20 plus or minus 2.94.
In other words, we are 95% confident that the interval US$20 ± US$2.94 created contains the
true value of the population mean for the average weekly spending on coffee.
It should also be noted that in the example discussed, there is a 5% chance that the interval we
have created will not contain the true population parameter.
3. Confidence Interval for a Mean (σ Unknown)

One problem typically arises in our calculation of the confidence interval is we rarely know the
value of population standard deviation, σ.
6
In this case, the next best option is to use the sample standard deviation, s, calculated from
sample data. However, in doing this we introduce additional variability and we can no longer
guarantee that our sampling distribution will be normal.
Instead, the sampling distribution follows a T-distribution. The T-distribution is similar to the
normal distribution but has the following properties:
1. It has a mean of 0.
2. It has a standard deviation that varies with the sample size. We can specify the degree of
freedom, which is the sample size minus one.
Some examples of the T-distribution are shown in the following graph.
Fig. 1: Examples of T-distributions
Three different T-distributions are shown, all going from minus 4 to plus 4 on the x-axis, and, at
the zero point on the x-axis, they are all at varying heights on the y-axis. On the y-axis, the green
curve goes to zero point three the red curve goes to zero point three five and the blue curve
goes to zero point 4.
In using the sample standard deviation, the general form of the confidence interval becomes
7
This formula states that the population mean is equal to plus or minus t multiplied by sample
standard deviation, subscript .
Where
This formula states that sample standard deviation, subscript is equal to sample standard
deviation divided by the square root of the sample population.
The T-value indicates the number of sample standard errors by which the sample mean differs
from the population mean. If we specify the confidence level, we can then find an appropriate
T-value for our confidence interval equation.
4. Sample Size Determination

So far we have assumed that we have taken a sample of a certain size and calculated the
confidence interval for our population mean.
One question that will arise is how big a sample should we take in the first place?
Recall that our general form of the confidence interval is
In most business situations, we would like to limit our uncertainty in our estimate of the
population mean.
If we let the maximum error we are willing to tolerate be represented by B, then it can be shown
that the required sample size is
Where
 z is the number of standard errors associated with a given confidence level (1.96 if we
assume a 95% confidence interval).
8
 σest is our estimate of the population standard deviation.

 B is the required maximum error in our estimate.
Exercise
Below is an exercise to practice what you have learnt on confidence intervals.
Exercise - Confidence Intervals

You have to estimate the average rental price of a two-bedroom unit in a particular city with
a 95% confidence level. Assume that the mean of this distribution is US$500 with a standard
deviation of US$150 dollars.
Question 1: Given the above what is the sample size required if we wish to estimate the
average rental price of a two-bedroom unit to within US$20?
1. 158
2. 190
3. 200
4. 217
Question 2: Given the above what is the sample size required if we wish to estimate the
average rental price of a two-bedroom unit to within 5% of the true value?
1. 100
2. 105
3. 110
4. 139
Question 3: Given the above what is the sample size required in Question 1 if the standard
deviation in the price of two-bedroom units was twice as large (i.e., US$300)?
1. 550
2. 800
3. 825
4. 865
9
5. Summary
 Estimates can be point estimates or interval estimates.

 A point estimate is the 'best guess' of the true population parameter. In this approach, a
sample of a certain size is used for calculating the sample mean.
 An interval estimate for a mean gives a degree of confidence that our population mean lies
in the confidence interval generated.
 If we do not know the population standard deviation, then we need to use the sample
standard deviation as our best estimate.
 The resulting sampling distribution follows a T-distribution. The required sample size can be
calculated if we know how much uncertainty we can have in our estimate.
 Microsoft Excel® can be used to calculate confidence intervals.
6. Glossary
Estimation The process of inferring the value of an unknown parameter using
sampling.
7. Answers
Exercise: Confidence Intervals
Question 1: Correct answer is option 4, 217.

10
Topic: Confidence Interval for a Proportion
Confidence Interval for a Proportion
Table of Contents
1. Sampling Distribution for a Proportion ........................................................................................... 4

2. Confidence Interval for a Proportion .............................................................................................. 6
3. Sample Size Determination ............................................................................................................. 7
4. Summary ......................................................................................................................................... 9
5. Answers ........................................................................................................................................... 9
2
Introduction
When dealing with qualitative data, we can determine the proportion of times that a value of
interest occurs. The parameter of interest in these cases is the population proportion.
We might be interested in making inferences about the population proportion. This can be a
point estimate such as the sample proportion or an interval estimate as already discussed.
In this topic, we will discuss the concept of interval estimate for a proportion.
Learning Objectives
• describe the sampling distribution for a population proportion
• determine the confidence interval for a population proportion
• determine the required sample size when estimating proportions.
3
1. Sampling Distribution for a Proportion

If our data type is qualitative, then the summary measure we are interested in is the proportion
of times that our value occurs. Surveys are often used to estimate proportions with the
parameter of interest being the population proportion, p.
The relationship between the population proportion and the sample proportion is shown in the
following figure.
Fig. 1: Schematic Diagram of Sampling for a Population Proportion
This graphic shows the relationship between the population proportion and the sample
proportion. A big square represents the population, with an arrow pointing to a small oval shape
which represents the random sample.
For example, suppose that we were interested in launching a new product and decided to
conduct a survey to try and find out the proportion of consumers who would be interested in
buying our product. From the survey results, we can obtain the proportion of consumers in our
sample who said that they would buy our new product. This proportion is known as the sample
proportion. Our objective is to use this result to make an inference about the true population
proportion.
The sample proportion is a point estimate of the population proportion and is represented as
4
This formula states that the sample proportion is equal to x divided by n.
Where x is the number of successes and n is the number of trials.
Overall, the true population proportion is p, from which we take a sample of size n. The number
of successes in a sample of size n is x. Therefore, the proportion of successes in a sample is
The number of successes (in this case the number of consumers who said that they would buy
the product) is a binomial random variable with
 a mean of E(x) = np
 a standard deviation of
Rather than consider the number of successes, it is far easier to talk about the proportion of
success. It can be shown as long as n is reasonably large and np > 5 and n(1- p) > 5, that the mean
of is given by
And the standard deviation of is given by
The sampling distribution of the sample proportion is approximately normally distributed, with
a mean of p, and a standard deviation is given by the square root of p(1- p)/n.
As for the case of mean, the standard deviation of the sample proportion is also known as the
standard error.
5
2. Confidence Interval for a Proportion

The procedure for obtaining the confidence interval for a population proportion is very similar
to the procedure for obtaining the confidence interval for the mean. The confidence interval is
our point estimate plus or minus some number of standard errors.
The confidence interval is based on a large sample size. As already discussed, the required
conditions are
If these conditions are satisfied, then the sampling distribution will be approximately normal,
and the confidence interval is given by
The appropriate Z-multiple will depend on the level of confidence we require. For the 95%
confidence interval, the Z-value will be 1.96.
As an example, suppose that we are conducting a survey of 400 households to find the
proportion of those that have purchased a high definition television. Assuming our survey found
that 60 of these have made such a purchase, estimate with 99% confidence the true proportion
of households in the population of interest that have made this purchase.
The point estimate, in this case, is 60/400 or 0.15. The interval estimate or confidence interval
is given by
We first check to see if the conditions for the normal approximation are satisfied as follows:
6
In both cases the conditions are satisfied.
Since we are after the 99% confidence interval, we are after the Z-value such that the area in
both tails is equal to 0.01. In other words, the area in each tail is 0.005. Using the NORMSINV
function in Microsoft Excel®, the corresponding Z-value can be found as
This value is negative since we have calculated the Z-value corresponding to the left-hand tail.
We are only interested in the magnitude of this number.
Therefore the 99% confidence interval is given by
In other words, the true population proportion of households that have purchased a high
definition television lies between the intervals of 10.4% to 19.6%.
3. Sample Size Determination
As for the case with means, we need to decide on how big a sample to take when we are
estimating proportions. In order to do this, we again have to specify the maximum amount of
uncertainty that we are willing to tolerate in our estimate of the true population proportion.
Recall the formula for the confidence interval for a proportion. The right-hand side term after
point estimate is the uncertainty term. The uncertainty in our estimate of the population
7
proportion is given by plus or minus Z multiplied by the square root of the sample proportion
multiplied by open bracket one minus the sample proportion, close bracket divided by n.
If we let the maximum error, we are willing to tolerate be denoted by B, then it can be shown
that the required sample size is given by
This formula states that the required sample size, n, is given by Z squared multiplied by the
sample proportion multiplied by open bracket one minus the sample proportion, close bracket
divided by B squared.
Where
 Z is a value corresponding to our desired level of confidence

 B is the required error
 p is an estimate of the true population proportion
In most cases, we do not have any knowledge of the true population proportion. In these cases,
it is best to use a value of 0.5.
Below is an exercise to practice estimating sample size for proportions.

Exercise: Sample Size for Proportions
Suppose that we were interested in determining the proportion of people in a particular city
that have a digital camera. We would like to estimate this proportion to within plus or minus
3%.
Question: Based on the information above, if we believe that the population proportion is
20%, what sample size would be required to estimate the proportion of people in this city that
own a digital camera?
1. 453
2. 500
8
3. 633
4. 683
4. Summary
 When dealing with qualitative data, it is generally the population proportion that is of
interest.
 It can be shown that as long as n is large enough such that np and n(1- p) are greater than or
equal to 5, then the sampling distribution of the sample proportion is approximately normal,
with a mean of p and a standard deviation given by the square root of p(1- p)/n.
 As for means, confidence intervals can be developed for the population proportion.
 The required sample size can be determined based on certain assumptions.
5. Answers
Exercise: Sample Size for Proportions
The correct answer is option 4, 683.
The answer is calculated as:
Where z=1.96 (for 95%confidence interval and e is the error in %). The numbers become
rounded up.
9
Segment: Hypothesis Testing
Topic: Concepts in Hypothesis Testing
Concepts in Hypothesis Testing
Table of Contents
1. Null and Alternative Hypotheses .................................................................................................... 4

2. One-Tailed and Two-Tailed Tests .................................................................................................... 5
3. Type I and Type II Errors ................................................................................................................. 6
4. Significance Level and p-Value ........................................................................................................ 7
5. Power of a Test ............................................................................................................................. 10
6. Summary ....................................................................................................................................... 10
7. Glossary ......................................................................................................................................... 11
2
Introduction
A hypothesis is a tentative explanation for an observation or phenomenon that has not yet been
verified. Hypothesis testing is the process of determining whether or not a given hypothesis is
consistent with observed facts.
This topic introduces the concepts involved with hypothesis testing and some of the issues
involved in testing the hypothesis.
Learning Objectives
 define null and alternative hypotheses
 distinguish between one and two-tailed tests
 recognise the significance level and calculate the p-value
 examine Type II errors and the power of a test.
3
1. Null and Alternative Hypotheses

We often need to make inferences about a population of interest based on sample data. The
inferences often involve making a decision about a particular theory or hypothesis that we would
like to test.
Example
For instance, a manufacturer of light bulbs might claim that the average life of their bulbs is
at least 8,000 hours. Based on the results of a sampling experiment conducted on the bulbs,
it is possible to calculate the probability of observing the result obtained (or more extreme)
from our sampling experiment if the claim under test is true. Depending on the results
obtained, the manager can then decide whether or not to reject the claim.
Before we can test the particular theory or hypothesis though, we need to understand the
basic concepts involved in hypothesis testing.
For the case of the claim made by the manufacturer of light bulbs about the average life of
the bulbs, we have:
It could happen that someone else, such as a consumer advocate, could make a counterclaim
that µ < 8,000. If so, which claim do we take as the null hypothesis?
It is customary to formulate the null hypothesis such that if it were true, then no special action
would be necessary, and if the alternative is true, then some special action would be
necessary.
This approach would favour the H0 and H1 as outlined above. The burden of proof is usually
on the alternative hypothesis H1. It is up to the researcher to provide enough evidence in
4
support of the alternative; otherwise, we must continue to believe that the null hypothesis is
true.
In the above case, if there was enough evidence in favour of the alternative hypothesis (i.e., µ <
8,000), then some action, such as requiring the manufacturer to revoke the claim and pay for
damages caused might be necessary.
Two important issues that should be kept in mind are:

1. Tests are only performed on population parameters
2. There is always an 'equals' sign in the null hypothesis. Note that this could be '=', '<' or '>'.
2. One-Tailed and Two-Tailed Tests

Hypothesis tests can be either one-tailed or two-tailed, depending on what we are trying to
prove.
Read below to find out which test method to use.
5
Hypothesis Testing Methods
One-tailed hypothesis test

A one-tailed hypothesis is one where the only sample results which can lead to rejection of
the null hypothesis are those in a particular direction. One-tailed alternatives are phrased in
terms of '>' or '<'.
For example, if the manufacturer was interested in whether or not the average life of the bulbs
was more than 8,000 hours, then the hypothesis might be set-up as:
H0: μ ≤ 8,000 hours
H1: μ > 8,000 hours
This is one-tailed.
Two-tailed hypothesis test

A two-tailed test is one where results in either of two directions can lead to rejection of the
null hypothesis.
Two-tailed alternatives are phrased in terms of '≠'.
For example, if the manufacturer of the light bulbs was only interested in whether the average
life of their bulbs was 8,000 hours, then the hypothesis might be set-up as a two-tailed test:
H0: μ = 8,000 hours
H1: μ ≠ 8,000 hours
Once the hypotheses are set-up, it is easy to detect whether the test is one-tailed or two-tailed.
The real question is whether to set-up a hypothesis for a particular problem as one-tailed or two-
tailed. There is no statistical answer to this question. It depends entirely on what we are trying
to prove.
3. Type I and Type II Errors

An analyst can either reject or fail to reject the null hypothesis that he or she is making.
Hopefully, the analyst will make the right decision, but it needs to be recognized that a mistake
could be made. The outcomes of a decision are shown in the figure below.
6
Fig. 1: Outcomes of Decisions Made When the Null Hypothesis is Rejected or Accepted
As can be seen from this figure, if we reject a null hypothesis that was false then we have made
a correct decision. Likewise, if we fail to reject a null hypothesis that was true, then we have also
made the correct decision.
However, we need to recognise that there are two types of decisions where we are making an
error:
1. The mistake of rejecting a true H0 is called a Type I error

2. The mistake of accepting a false H0 is called a Type II error
For example, suppose a bank manager was interested in whether or not the mean waiting time
for customers had increased from its previous value. In this case, the null hypothesis might be
that the mean waiting time has not changed. If the manager sampled a group of customers and
performed a hypothesis test on the results, the possible errors that the manager could make
are:
1. Type I error: Concluding that the average waiting time had increased when it had not
2. Type II error: Concluding that the average waiting time was the same, when in fact it had
increased
4. Significance Level and p-Value

The real question is how strong the evidence in favour of the alternative hypothesis must be in
order to reject the null hypothesis. To do this, we need to consider the concepts of the
significance level and the p-value.
Read below to find out which test method to use.
7
Significance levels
The researcher must determine the maximum probability of a Type I error that he or she is
willing to tolerate. This value is denoted by α, the significance level of the test, and is most
commonly equal to 0.05, although α = 0.01 and α = 0.10 are also frequently used. Then, given
the value of α, we can use statistical theory to determine the rejection region. If the sample
evidence falls into this region, we reject the null hypothesis; otherwise, we do not reject it.
Sample evidence that falls into the rejection region is called statistically significant at the α
level.
p-value
An alternative approach avoids the use of the significance level and the rejection region and
instead simply reports how significant the sample evidence is. We can do this by calculating
the p-value of the hypothesis test. The p-value of a sample is the probability of seeing a sample
with at least as much evidence in favor of the alternative hypothesis as the sample actually
observed. The smaller the p-value, the more evidence there is in favor of the alternative
hypothesis.
Example:
Suppose the manufacturer of the light bulbs in the example above was interested in testing
whether the lifetime of the bulbs was more than 8,000 hours. The manufacturer might then
take a sample of these bulbs and find that the average life of the sample of bulbs was 8,750
hours. The p-value, in this case, is the probability that we could get a sample mean of 8,750
hours or more, assuming that the true average life of the bulbs was 8,000 hours.
In general, smaller p-values indicate more evidence in support of the alternative hypothesis. If a
p-value is sufficiently small, almost any decision-maker will conclude that rejecting the null
hypothesis is the more reasonable decision.
How small is a 'small' p-value?
This is largely a matter of semantics but if the
 p-value is less than 0.01, it provides convincing evidence that the alternative hypothesis is
true.
8
 p-value is between 0.01 and 0.05, there is strong evidence in favour of the alternative
hypothesis.
 p-value is between 0.05 and 0.10, it is in a grey area.
 p-values greater than 0.10 are interpreted as weak or no evidence in support of the
alternative.
Test statistic
To compute the p-value, we use a test statistic computed from the sample data.
The test statistic is a random variable whose distribution is well known so that certain desired
probabilities can be calculated.
For instance, consider the case where we are testing a statement about µ using a sample size of
at least 30 (n >= 30) and σ is known.
Sampling theory tells us that the quantity:
will follow a Z-distribution.
Thus, the test statistic in this case will be:
Now, consider if σ is unknown and the population is normal. Then, the test statistic would be:
where the sample standard deviation, S, has been substituted in place of the population
standard deviation, σ. Fortunately, p-values for a variety of statistical tests are easily calculated
using Excel and other softwares.
9
5. Power of a Test
The probability that we could commit a Type II error is denoted by β. A Type II error is the
probability that we will fail to reject the null hypothesis in the situation where the null hypothesis
is actually false. Given that, the probability that we would correctly reject a null hypothesis that
was actually false is given by 1 - β. This probability is known as the power of a test.
Example:
Suppose that you were the marketing manager for a pharmaceutical company that has
developed a new drug, and your company was making claims that the drug was safe. In
medical trials to test this claim, the null hypothesis would be that the drug was safe.
If we were to commit a Type II error during these tests, we would be failing to reject a null
hypothesis that was actually false. In other words, we would be saying that the drug was safe
when in fact it was not safe. In this case, we would be committing a serious error.
In this case, we would like to minimise the probability of making a Type II error. To do that,
we would like to increase the power of the test. One popular method of increasing the power
of the test is to increase the sample size.
6. Summary
 A null hypothesis is a statement about a population parameter that can be tested.

 The logical opposite of the null hypothesis is the alternative hypothesis.
 Depending on the way the null hypothesis is set-up, the rejection region occurs on either
one or both tails of the probability distribution of the test statistic.
 Correspondingly, the test is either a one-tailed test or a two-tailed test.
 In any test, there will be chances of Type I and Type II errors occurring.
10
7. Glossary
Confidence level In interval estimation, it is the probability that the value being
estimated lies within the estimated interval. In hypothesis
testing, it is the complement of the level of significance.
Test statistic In hypothesis testing, a statistic computed from sample data
which follows a well-known distribution so that probabilities
such as the p-value can be calculated.
11
Topic: Hypothesis Tests for a Population Mean and
Population Proportion
Hypothesis Tests for a Population Mean and Population Proportion
Table of Contents
1. Testing a Population Mean .............................................................................................................. 4

2. Testing a Population Proportion ...................................................................................................... 6
3. Summary ......................................................................................................................................... 9
2
Introduction
There are many instances in business where we need to decide about a single population. For
example, a claim might be made that
 The average number of complaints made by customers has increased above the usual
level.
 A company's product does not weigh the same amount as stated on the packaging.
 The average time spent online by internet users has increased. A sample of internet users
could be taken, and a hypothesis test conducted.
In these cases, inferences can be made about a population parameter of interest.
When our data is qualitative in nature, the population parameter of interest would be the
population proportion. The point estimator of this parameter is the sample proportion, which
under certain conditions has an approximately normal sampling distribution.
In this topic, we will explore the concept of testing a population mean and a population
proportion.
Learning Objectives

 test a mean with an unknown population standard deviation
 illustrate Hypothesis test for population proportion.
3
1. Testing a Population Mean
In hypothesis tests of a mean, we are testing to see if our sample data is consistent with our
hypothesised mean. In order to do this, we need to consider the sampling distribution of the
sampling mean. If we know the population standard deviation and our sample size is large
enough, then the sampling distribution of the sample mean will follow a normal distribution.
In practice though, we rarely know the value of the population standard deviation, and as in
estimation, if σ is unknown, we use s as the replacement. Our sampling distribution will then
follow a T-distribution with n - 1 degrees of freedom.
Example:
Suppose an entrepreneur was considering a location for an electricity-producing windmill
farm. To be successful, the average wind speed at that particular location should be greater
than 32 km/h. In order to test the location, the entrepreneur arranged for 50 wind speed
measurements to be taken at the location on randomly selected days over a period of time.
The average wind speed from the sample of 50 days was 38 km/h with a sample standard
deviation of 25 km/h.
Should the entrepreneur go ahead with the windmill farm?
In this case, we are testing a mean with an unknown population standard deviation, so the
sampling distribution will follow a T-distribution. The null and alternative hypotheses that the
entrepreneur would be interested in testing are:
H0: µ < 32
H1: µ > 32
Assuming the null hypothesis to be true, the sampling distribution is as shown in the following
figure.
4
Fig. 1: Sampling Distribution of Mean When Null Hypothesis is True
The null hypothesis is that the population mean is less than or equal to 32, and the alternative
hypothesis is that the population mean is greater than 32.
The p-value is the probability that we could have obtained our sample result as extreme or more
extreme if the null hypothesis is true. In this case, it is the probability in the tail to the right of
our observed value of 38 km/h since our test is a one-tailed test.
In this case, the test statistic can be calculated as:
The p-value can then be found using Excel's TDIST function:

= TDIST (1.697, 49, 1)
= 0.04802 = 4.8%.
This is a reasonably small p-value. Our decision in this case would be that there seems to be
sufficient evidence to reject the null hypothesis and conclude that the average wind speed is
greater than 32 km/h at this location.
5
In other cases, we may be interested in performing a two-tailed test. In these cases, the p-value
is the probability in both tails.
2. Testing a Population Proportion
As we already know, as long as our sample size is reasonably large such that np and n(1 - p) are
greater than or equal to 5, the sampling distribution of the sample proportion is approximately
normally distributed with a mean of p and a standard deviation given by the square root of
p(1 - p)/n. Note that the term n refers to the sample size while the term p refers to the population
proportion.
Similar to hypothesis tests for a population mean, hypothesis tests can also be performed for
population proportions. For proportions, we can calculate the sample proportion from our
sample data and then use this to calculate our p-value for the test.
Following is an example for performing hypothesis tests for population proportions.
Example:
A company is considering introducing a new product and has determined that it needs to
capture a market share of 10% to break even. The product will be profitable for the company
if it can capture more than 10% of the market.
Suppose that the company surveyed 400 potential customers to ask them if they would buy
the product. Suppose also that they received positive responses from 52of these people who
said they would purchase the product.
Given this data, is there enough evidence to suggest that the company should proceed with
the launch of the product?
Solution
The parameter of interest, in this case, is the population proportion of potential customers
who would buy the product. The company has decided that they need to capture more than
10% of the market, so in this case, the null and alternative hypotheses would be:
6
H0: p < 0.1

H1: p > 0.1
The sample size was 400 with np = 40 and n (1 - p) = 360. Since np and n (1 - p) are both
greater than 5, the sampling distribution will be approximately normally distributed with:
E(p) = p = 0.1
and
The sample proportion in this case is 52/400 = 0.13. The following figure shows the sampling
distribution assuming the null hypothesis to be true and p-value for the test.
Fig. 2: Sampling Distribution of Proportion When Null Hypothesis is True
The p-value is the probability that we could have obtained our sample result as extreme or more
extreme if the null hypothesis is true. In this case, it is the probability in the tail to the right of our
observed proportion of 13% since our test is a one tailed test.
In this case, the test statistic can be calculated as:
7
The test statistic Z is equal to the sample proportion minus the hypothesised population
proportion divided by the square root of the population proportion multiplied by q (1 minus
the population proportion) divided by the sample size which is equal to 0.13 minus 0.1 divided
by the square root of 0.1 multiplied by 0.9 divided by 400 which is equal to 2.
The p-value can then be found using Excel® NORMDIST or NORMSDIST functions:
= 1 - NORMDIST (0.13, 0.1, 0.015, 1)
= 0.02275 = 2.275%
This is a reasonably small p-value. Our decision, in this case, would be to reject the null
hypothesis and conclude that the true population proportion of consumers who would buy
the product is greater than 10%.
As for means, we may be interested in performing a two-tailed test for proportions. In these
cases, the p-value is calculated as the probability in both tails.
8
3. Summary
 Hypothesis tests for a population mean can be either one-tailed or two-tailed.

 If the population standard deviation is known and the sample size n > 30, then the test
statistic will follow a normal distribution.
 If the population standard deviation is not known, then the sample standard deviation must
be used. If we assume the population is normally distributed, the resulting test statistic will
follow a T-distribution with n - 1 degrees of freedom.
 Hypothesis tests for a population proportion can be either one-tailed or two-tailed.
 If np and n (1 - p) are both greater than 5, then the sampling distribution will be
approximately normally distributed.
9
Topic: Testing Differences between Population
Means
Testing Differences between Population Means
Table of Contents
1. Comparing Population Averages..................................................................................................... 4

2. Paired Difference Method .............................................................................................................. 4
3. Independent Samples Method ....................................................................................................... 5
4. Using Excel for Testing Differences ................................................................................................. 8
5. Summary ....................................................................................................................................... 12
6. Answers ......................................................................................................................................... 12
2
Introduction
We quite often have to test for differences between two population means. Depending on the
circumstances, the samples can be either independent or paired.
In this topic, we will explore the concept of testing for differences between two population
means.
Learning Objectives

 distinguish between the paired difference method and the independent sample method
used when performing hypothesis tests between two means
 use Microsoft Excel® to perform hypothesis tests for the differences between two means.
3
1. Comparing Population Averages
We saw the case where the light bulb manufacturer claimed that the average life of its light
bulbs was at least 8,000 hours. What if the claim was that their bulbs last longer than another
brand XYZ?
Now it is irrelevant what the average value is; what matters is the difference between the
averages of two brands.
Using the sampling theory, we have investigated so far, we can compare two population
averages in two different ways:
1. Paired difference method

2. Independent samples method
As you may guess, these methods are applicable under certain conditions. In this topic, we
shall explore the details of the two methods and their applicability.
2. Paired Difference Method
This method is applicable when we are comparing a sample pair of items. For example, if we
wish to compare the effectiveness of a training program on employees, we can measure the
performance of the employee before and after the training. It does make sense to compare
the two measurements and take the difference. The difference can be attributed to the
effectiveness of the training. The t-test for the difference between means from paired samples
is used. The advantage of this method is that the effects of extraneous variables are controlled
and thus measured differences are less prone to error. As a result, the test is more reliable.
Following are the notations used for the comparison of two populations:
Notation for the claimed difference in the population means is D subscript zero.
4
Notation for the average of the sample differences is D bar.
Notation for the standard deviation of the sample differences is italic S subscript D.
Notations for the claimed difference in the population means is italic n.
The test statistic for the paired observation test is:
When the null hypothesis
is true, and the two populations are normally distributed, the quantity will
follow a T-distribution with (n – 1) degrees of freedom. As a result, we conduct a T-test,
meaning the test statistic is t.
3. Independent Samples Method
If a paired difference test is possible in a situation, then that must be the preferred method
since it is more reliable. But a paired difference is not always possible.
Suppose we wish to compare the average number of sick days taken by employees in two
departments of a large company over the period of a year.
In this case, we need to take one random sample from the first department and another
independent sample from the other department. We then have to take the difference between
the averages of the two samples.
The random variable of interest is the difference between the two sample means, denoted by:
5
Using sampling theory, we can infer the following:
1.
2. Since and are independent random variables, the variance of their difference
should be the sum of their individual variances. In other words:
3. If both populations are normally distributed then will also be normally

distributed. As a result, the quantity calculated from the formula below will follow a Z-
distribution.
4. Even if the two populations are not normally distributed, as n1 and n2 increases, the
distribution of will approach a normal distribution. Practically, if n1 and n2
are both at least 30, we can assume is normally distributed.
The above results enable us to conduct a Z-test when either the populations are normally
distributed, or the samples are large.
But there is a twist. The computation of Z calls for σ12 and σ22. Most of the time, they are not
known.
We then have to resort to a T-test. In addition, there are two cases:
 case 1: σ12 = σ22 (The variances are equal)

 case 2: σ12 ≠ σ22 (The variances are not equal)
The formulas for the sampling distributions are slightly different for the two cases. While we
will be using Excel® to perform these calculations for us, it is instructive to be aware of the
calculations involved.
6
The following table outlines the calculations.
If σ12 and σ22are unknown but equal, Then the quantity:
Open bracket, numerator is open bracket x bar subscript one minus x bar subscript two close
brackets minus D subscript zero. Denominator is square root of S subscript p squared divided
by n subscript one, plus S subscript p squared divided by n subscript two.
7
If the variances of the two populations can be assumed to be equal, then the equal variance
method is preferred. In cases where the variances are clearly different, then the unequal
variance case will need to be implemented. It should also be noted that there is a hypothesis
test that can be performed to test whether the two variances are equal.
4. Using Excel for Testing Differences
Suppose that we were interested in determining whether there was any difference between
the number of sick days used by employees in two departments in a large firm over the period
of a year. The following table shows the results of the random sample of 20 employees taken
from each of the two departments.
8
Table 1: Random Sample of 20 Employees
Since the samples are independent, we need to perform an independent samples test for the
difference between the two means.
Before we can do the test though, we need to determine whether the variances in the two
groups are the same. One way to do this might be to generate summary statistics for each
group and make an informed decision. Alternatively, Excel® provides a statistical test for this
comparison known as the F-test. This test can be found under Data/Data Analysis/F-test: Two-
Sample for Variances. The following table shows the Excel® output from this test.
Table 2: F-test: Two-Sample for Variances
9
The p-value, in this case, is slightly less than 0.05, which might lead us to reject the null
hypothesis and conclude that there is sufficient evidence to conclude that the variances are
significantly different from each other.
If we were to conclude this, then the appropriate T-test for the difference between the two
means would be performed. This test can be found in Excel® under Data/Data Analysis/t-test:
Two-Sample Assuming Unequal Variances.
The Excel® output for this test is shown below:
Table 3: T-test: Two-Sample Assuming Unequal Variances
If we were performing a two-tailed analysis, then, as can be seen from the Excel® output, the
p-value is 0.172. In this case, the p-value is relatively large, and we would not reject the null
hypothesis. We would conclude in this instance that there was insufficient evidence to
conclude that there was a difference between the average number of sick days used by
employees in the two departments.
Below is an exercise to practice what you have learnt on testing means.
Exercise: Testing Differences in Population Means

The owner of a city car park suspects that the person she hired to run the car park is stealing
some money. The receipts as provided by the employee indicate that the average number of
cars parked is 125 per day and that, on average, each car is parked for 3.5 hours. In order to
determine whether the employee is stealing, the owner watches the car park on five
10
randomly chosen days. On those days the average number of cars parked is 120, 130, 124,
127, 128.
For the total number of cars that the owner observed during the five days, the mean and the
standard deviation of the time spent at the car park were 3.6 and 0.4 hours respectively.
Question 1: Which of the following would best represent the hypothesis test the owner
would be interested in testing?
1. H0 : µ = 125, H1 : µ = 125 and H0 : µ ≥ 3.5, H1 : µ < 3.5
2. H0 : µ ≤ 125, H1 : µ > 125 and H0 : µ ≤ 3.5, H1 : µ > 3.5
3. H0 : µ ≥ 125, H1 : µ < 125 and H0 : µ ≥ 3.5, H1 : µ < 3.5
4. H0 : µ = 125, H1 : µ < 125 and H0 : µ ≥ 3.5, H1 : µ < 3.5
Question 2: Suppose that the p-values obtained for the tests were 0.002 and 0.6
respectively. The owner concludes that the employee is stealing. Is this statement true or
false?
1. True
2. False
11
5. Summary
 Population averages can be compared in two ways, namely, Paired difference method
and Independent samples method
 If the variances of the two populations can be assumed to be equal, then the equal
variance method is preferred.
 If the variances are clearly different, then the unequal variance case will need to be
implemented.
 Hypothesis test that can be performed to test whether the two variances are equal.
 Hypothesis tests for the differences between two means can be performed using
Excel®.
6. Answers
Exercise: Testing Differences in Population Means
Question 1: The correct answer is option 2, H0: µ ≤ 125, H1: µ > 125 and H0: µ ≤ 3.5, H1: µ >
3.5
Question 2: The correct answer is option 1, True.
12
Segment: Regression Analysis
Topic: Simple Linear Regression – Part I
Simple Linear Regression
Table of Contents
1. Simple Linear Regression Model ..................................................................................................... 4

2. Least Squares Method .................................................................................................................... 5
3. Summary ....................................................................................................................................... 10
4. Glossary ......................................................................................................................................... 10
5. Answers ......................................................................................................................................... 10
2
Introduction
Regression analysis is one of the most important and widely used statistical techniques. It has
many applications in business and economics.
We saw that there are two types of regression. In this topic, we will discuss simple linear
regression in detail. In simple linear regression, we model the relationship between two
variables.
Learning Objectives
 comprehend the simple linear regression model and its significance
 use the least squares method to obtain the regression equation.
3
1. Simple Linear Regression Model

In many business situations, we would like to investigate and quantify the relationship between
two or more variables. For example, we may be interested in determining how expenditure on
advertising affects sales for a particular company.
If so, it would be useful to quantify their relationship so that we can predict one of them by
knowing the value of the others, or control one of them by controlling the others.
A simple type of relationship between two variables x and y is the linear relationship where the
graph of the relationship is a straight line as shown in the following figure.
Fig. 1: Simple Linear Relationship – Straight-line Graph

The figure given above shows a straight line graph. The distance between the zero point and the
point of intersection on the y-axis is called the intercept, while the angle of the line is called the
slope.
The objective in simple linear regression is to find the relationship between two variables x and
y.
The technique involves developing a mathematical model to describe the relationship between
the variable we are trying to predict and the variable that we suspect is influencing this variable.
The variable we are trying to predict is known as the dependent variable (denoted by y) and the
variable that we would like to use to explain the dependent variable is known as the independent
variable (denoted by x).
4
In our example, the expenditure on advertising would be the independent variable and sales
would be the dependent variable.
The first step in developing the mathematical relationship is to define the population model. In
the case of simple linear regression, where our objective is to explain the relationship between
the independent variable and the dependent variable, this model takes the form of:
The error term (ε) is included in the population model to account for all variables and
measurement uncertainties that we may not have included in our model but may influence the
dependent variable.
2. Least Squares Method

Suppose we suspect that a linear relationship exists between x and y and wish to quantify the
relationship in the form of an equation.
We could collect sample data for the two variables and plot them on a graph. The following
figure shows one such plot.
Fig. 2: Sample Scatter Plot
5
The figure above shows a sample scatter plot, with a series of points scattered almost in a
straight line from the 0 point.
The graph is called a scatter plot since it shows a scatter of points.
The scatter plot in the figure suggests that there is a linear relationship between the two
variables because the points seem to fall along a straight line.
To quantify the relationship, we fit a straight line to the scatter of points. This straight line is
called the regression line. The regression line has the form:
The regression coefficients b0 and b1 are estimates of the true population regression
coefficients β0 and β1.
The technique that we use to fit a straight-line to the data is known as the least squares method.
The resulting line is known as the line of best fit.
The line of best fit will not necessarily pass through all of our sample data points. In most cases,
there will be differences between our data points and the line of best fit as shown in the
following figure.
6
Fig. 3: Residuals
This sample chart (Fig. 3) shows the difference, or sample error, between a data point on a chart,
and the line of best fit.
The differences between the data points and the line of best fit is known as residuals and are
denoted by ei.
The residuals are defined as:
The least squares method involves minimising the sum of squares of error (SSE).
The sum of squares of error is defined as:
The calculations required to determine the line of best fit are easily calculated using statistical
software such as Microsoft Excel®.
As an example, suppose we were interested in determining the relationship between advertising

and sales for a particular company and had collected annual sales and annual expenditure on
advertising for the past five years as follows:
7
Table 1: Annual Sales and Annual Expenditure on Advertising for the Past Five Years
If we suspect that there is a linear relationship between advertising and sales, then we can
perform a regression analysis on this data. To do this using Microsoft Excel®, we would select
Tools/Data Analysis/Regression.
The regression output for this particular example using Microsoft Excel® is shown below.
Fig. 4: Regression Output

The output contains a number of calculations that can be used to determine the line of best fit
and to assess how well the line fits the data. We will explore this output in the following sections.
8
Below is an exercise to practice what you know about outliers.
Exercise: Simple Linear Regression

Read the given information before attempting the questions below.
Information:
A researcher was interested in determining whether there was a relationship between age and
salary for a group of CEO's.
Data was collected for 25 CEO's and charted as shown in the scatter plot below:
Fig. 5: Scatter Plot for 25 CEO’s Age and Salary
Question 1: What effect does the highest paid CEO have on the slope of the line of best fit?
A) No change
B) Increases the slope of the line of best fit
C) Decreases the slope of the line of best fit
Question 2: What would happen to the slope of the line of best fit if the highest paid CEO was
actually 50 years old instead of 70?
A) No change
B) Increases the slope of the line of best fit
C) Decreases the slope of the line of best fit
9
3. Summary
 Simple linear regression is a technique for estimating the straight line relationship between
two variables.
 The most popular approach to regression is the least squares method, which minimises the
SSE.
4. Glossary
Scatter plot A plot of line of best fit.
Line of best fit In hypothesis testing, a statistic computed from sample data
which follows a well-known distribution so that probabilities
such as the p-value can be calculated.
5. Answers
Exercise: Simple Linear Regression
Question 1: Correct answer is option B - Increases the slope of the line of best fit.
Question 2: Correct answer is option C - Decreases the slope of the line of best fit.
10
Topic: Simple Linear Regression – Part II
Table of Contents
1. Interpreting Coefficients ................................................................................................................. 4
2. Assessing the Model ....................................................................................................................... 5
3. Prediction ........................................................................................................................................ 7
4. Summary ......................................................................................................................................... 8
2
Introduction
The relationship between the dependent and independent variables can be represented
mathematically by means of the simple linear regression model. Having known about the
procedure of drawing output from Microsoft Excel, now, in this topic, we will focus on the
evaluation of the regression model. We will also study the interpreting coefficients, assessing
the model, and predicting the value of the dependent variable for the given independent
variable.
Learning Objectives

 evaluate the fit of a regression model
 use the regression model for prediction.
3
1. Interpreting Coefficients
The line of best fit can be determined from the regression output. Below shown Excel output
example is discussed in Part I.
The resulting regression line can then be expressed as:
The coefficient b0 in this example is -49.4. This would be the y-intercept if our independent
variable x was 0. In this case, it can be interpreted as when we have no advertising (i.e., x = 0),
then we will have negative sales of US$49,400. Clearly this is not possible and is meaningless in
4
this case. We need to be particularly careful when predicting outside our range of data. In this
case, we did not have any data points where advertising levels were around US$0.
The slope coefficient, b1 in this example was 34.5. We can interpret this as for each additional
thousand dollars of advertising, sales will on average increase by US$34,500.
2. Assessing the Model
The output from the least squares method produces the line of best fit, however, it is important
to assess how well the regression line fits the data. If the fit of the regression line is poor, we
may need to reconsider our model.
In order to evaluate the model, we can assess three items:
1. The standard error of estimate

2. A hypothesis test of the slope
3. The coefficient of determination
Read below to find out more about each of these. Note that all three items make use of the
Microsoft Excel® output example.
5
Evaluating the Regression Model
Standard error of estimate

The first item, standard error of estimate is included in the top left-hand section of the
Microsoft Excel® output example.
It is essentially the standard deviation of the residuals and is a measure of how far on average
our data points are from the line of best fit.
In general, the standard error of estimate indicates the level of accuracy of predictions that
can be made from the regression equation. The smaller the standard error of estimate, the
more accurate the predictions tend to be. Comparing the standard error of estimate to the
average value of the dependent variable data is one relative measure that can be used as a
guide to assessing the model.
In this case, the standard error of estimate is 79.749 which when compared to the average
value of our y variable data of 744 is quite reasonable.
Hypothesis test of the slope

The second item to evaluate the usefulness of a model is the hypothesis test of the slope, of
the sample regression line. The hypotheses in this case are:
H0 : β1 = 0
H1 : β1 ≠ 0
The null hypothesis indicates that there is no relationship between the variables.
The p-value for this test is given in the Microsoft Excel® output example. In this case, it is
0.0038 or 0.38%. Since this value is small (certainly smaller than if we set α = 0.05), we can
reject the null and conclude that the slope is not 0. In other words, there seems to be a
significant relationship between sales and advertising.
6
It should be noted that the p-value calculated in the Microsoft Excel® output is always a two-
tailed test. This value would need to be halved if a one-tailed test is required.
Coefficient of determination (R2)

The value of the coefficient of determination or R2 value is between 0 and 1. This value can be
interpreted that the proportion or percentage of the variation in the dependent variable that
can be explained by the independent variable.
The value for the coefficient of determination can be found in the top left-hand section of the
Microsoft Excel® output example. Here, output is referred to as 'R square'.
In this example, the R2 value is 0.957 or 95.7%. This can be interpreted that advertising levels
can explain 95.7% of the variation in sales. In general, the higher the R 2 value, the more
confidence we can have in our model.
3. Prediction
Another important use of regression is the prediction of y when the x value is known. The
predicted value of y is denoted by and is calculated as:
By substituting a particular value for the independent variable into the sample regression
equation, we can calculate a value for the dependent variable. This estimate is a point estimate.
For the example of sales and advertisement, by substituting any given value for the dependent
variable advertising, the predicted value of the sales can be obtained. If we substitute the
advertising value US $30,000 in the equation, the predicted sales will be:
= -49.4 + 34.5 * 30
= 985.6 (approx. 986)
7
Therefore, we can say that for the advertising expenditure of US $30,000, the estimated sale is
equal to the US $986,000.
4. Summary
 The standard error of estimate, a formal hypothesis of the slope and the coefficient of
determination R2 can be used to evaluate the fit of the regression model.
 The standard error of estimate indicates the level of accuracy of predictions that can be
made from the regression equation.
 The value of the coefficient of determination or R2 value is between 0 and 1.
 The regression model can be used for prediction.
8
Topic: Multiple Regression
Multiple Regression
Table of Contents
1. The Multiple Regression Model ...................................................................................................... 4

2. Interpreting the Coefficients ........................................................................................................... 6
3. Assessing the Model ....................................................................................................................... 7
3.1 Test for Significance of Regression ......................................................................................... 7
3.2 Test for Individual Regression Coefficients (t-tests) ............................................................... 7
3.3 R2 and Adjusted R2 ................................................................................................................... 8
4. Summary ....................................................................................................................................... 10
2
Multiple Regression
Introduction
In regression analysis, there may often be several independent variables that contain
information about the variable we are trying to predict or understand.
The multiple regression model allows us to consider the relationship of a particular variable with
a set of independent variables.
Learning Objectives
 explain the concept of multiple regression analysis
 assess the utility of the multiple regression model.
3
Multiple Regression
1. The Multiple Regression Model

Often the dependent variable y depends on more than one independent variable. It is then
necessary to relate y to all the independent variables in one equation. For instance, the sales of
a product may be influenced not only by the marketing budget, but also by its price, its quality,
the economy, or the degree of competition. Multiple regression is an analytical method that
relates a dependent variable to more than one independent variable.
The objective is to get an equation of the form:
Where k is the number of independent variables x1, x2, … xk, that affects y.
Most of the concepts in multiple regression are similar to those in simple linear regression. For
instance, the equation that best fits the data is taken to be the one that minimises the sum of
the squares of the errors (SSE).
Consider a hypothetical example provided below in the table, which is related to the sales a
particular company. Sales can be treated as the dependent variable, whereas, the price of the
product and the advertising expenditure are the independent variables.
To determine the relationship among the variables price, advertisement, and sales for a
particular company, we collected the data for the three variables for the past ten years as
follows:
4
Multiple Regression
Table 1: Sales of a Company
Sales Price Advertisement

(US $ ‘000) (US $) (US $ ‘000)
88 117 33
90 111 37
100 120 34
62 93 16
72 97 28
94 112 38
74 88 33
73 96 32
76 99 35
93 116 38
We assume that there is a linear relationship between the three variables and can use multiple
linear regression to understand the relationship. Here, price and advertisement are independent
variables that can be used to predict the dependent variable sales.
The regression output for this particular example using Microsoft Excel® is shown below.
5
Multiple Regression
2. Interpreting the Coefficients

The interpretation of the regression coefficients in the multiple regression model is similar to
the case for simple linear regression although a little extra care is required when interpreting
the coefficients.
For example, suppose our sample regression equation is in the form:
The coefficient for the intercept, b0, can be interpreted similar to the simple linear regression
situation, and in this case, represents the value of when x1 and x2 are both zero.
The coefficient of x1, which is b1, represents the increase in when x1 increases by one unit
assuming that the other dependent variable x2 is held constant.
Likewise, the coefficient of x2, which is b2, represents the increase in when x2 increases by one
unit assuming that the other dependent variable x1 is held constant.

In our example, we can write the regression equation using the Excel output as given below:
6
Multiple Regression
The coefficient b0 in this example is – 20.74, which signified the value of when the two
independent variables x1 and x2 both are zero. Again, we know that it is meaningless and of less
significance, since at least the price cannot be zero. The slope coefficient, b1 in this example is
0.76. We can interpret this as for each additional dollar of price, sales will on average increase
by US $ 760. Similarly, the slope coefficient, b2 in this example is 0.71. We can interpret this as
for each additional thousand dollars of advertising, sales will on average increase by US$ 710.
3. Assessing the Model
3.1 Test for Significance of Regression
To test the utility of the regression model we can specify the following hypothesis:
H0: β1 = β2 …βk = 0
H1: At least one βi is not equal to zero
This test is known as the F-Test and the p-value for this test can be found in the ANOVA section
of the Microsoft Excel® output.
There can be two outputs. The null hypothesis for this test is:
1. Not Rejected – it implies that none of the independent variables are linearly related to the
dependent variable and therefore the model has limited usefulness.
2. Rejected – it suggests that at least one of the coefficients of the independent variables is not
equal to zero and that the model does have some usefulness.
From the Excel output table, under the ANOVA section, it is evident that we can reject the null
hypothesis since significance F is 0.00 which is less than 0.05. This means at least one of the
coefficients of the independent variables is not equal to zero and we can use the model.
3.2 Test for Individual Regression Coefficients (t-tests)
The table below ANOVA section can be used to understand the relationship between each
independent variable with the dependent variable. In order to check whether each individual
regression coefficient of the independent variables is significant, we can make use of t-test
7
Multiple Regression
values provided in the last table of multiple regression analysis output. Here, based on the p-
value, it can be decided that a particular independent variable is to be kept or removed from the
model. The p-value will reveal whether there is a linear relationship between each pair of
dependent and independent variables. The fifth column in the table shows the p-value for the
intercept and the other independent variables. We are only interested in exploring the
relationship between dependent and independent variables.
The null and alternative hypothesis can be defined as
H0: βi = 0
H1: βi ≠ 0
Where βi indicates the slope of ith variable.
In our example, both the p-values for the independent variables are less than 0.05. Therefore,
we can state that variables price and advertising are significant in the regression model which
can be used to predict sales.
3.3 R2 and Adjusted R2

As with simple linear regression, the coefficient of determination is denoted by R2. A weakness
with R2 is that it can only increase as another variable is added to the model, even if that variable
is totally irrelevant.
8
Multiple Regression
For example, suppose a regression of:
Now, if we add an irrelevant independent variable such as:
to the regression model, just because we happen to have that data. When we complete the
regression computations, the R2 value may very well have increased from 0.75 to 0.80.
Indeed, it can be shown that R2 can only increase and not decrease.
In other words, if we assess how good a regression is solely by looking at how large the R2 value
is, we could be making a mistake. It would be helpful to have an alternative measure for this
purpose.
Such a measure is the adjusted R2.
R2 is defined as: R2 = 1 – SSE / SST
and adjusted R2 is defined as: Adjusted R2 = 1 – MSE / MST
Thus, the adjusted R2 pays attention to the degrees of freedom.
Consequently, when an irrelevant independent variable is added, the Adjusted R2 will most likely
decrease. Therefore, it is customary to look at the value of Adjusted R2 in addition to the R2 value
when assessing multiple regression results. The adjusted R2 adjusts for the number of predictor
terms in the model.
In our example, we can see that the R square value is 0.94 and the adjusted R square is 0.92.
9
Multiple Regression
4. Summary
 Multiple regression is an analysis that relates a dependent variable to more than one
independent variable.
 In assessing a multiple regression model
o the F-test can be used to test the overall utility of the model
o each separate t-tests are aimed at checking the significance of a single variable.
10

Business Statistics 4-6

Uploaded by

Copyright:

Available Formats

Business Statistics 4-6

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Business Statistics 4-6

Uploaded by

Copyright:

Available Formats

BUSINESS STATISTICS

1. Concepts in Sampling ...................................................................................................................... 4

Read below for an example of a biased sample.

Literary Digest and the 1936 US Presidential Election Polls

2. Simple Random Sampling

There are two methods of taking a simple random sample:

Fig. 1: Simple Random Sampling

3. Stratified Random Sampling

Fig. 2: Stratified Random Sampling

Fig. 3: Cluster Sampling

4. Comparison of Sampling Methods

Table 1: Advantages and Disadvantages of Various Sampling Methods

Exercise: Sampling Methods

Here is a quick recap of what we have learnt so far:

Simple random Cluster sampling Stratified random

1. Sampling for the Population Mean ................................................................................................. 4

1. Sampling for the Population Mean

1. The sample mean,

Here are a few observations:

2. Central Limit Theorem

Exercise: Central Limit Theorem

Here is a quick recap of what we have learnt so far:

1. Estimation Concepts ....................................................................................................................... 4

There are two types of estimates, namely:

Read below to learn more about the two types of estimates.

2. Confidence Interval for a Mean (σ Known)

In general, our equation becomes

Where Z is the number of standard errors each side of the mean.

The 95% confidence interval would then become

3. Confidence Interval for a Mean (σ Unknown)

Some examples of the T-distribution are shown in the following graph.

Fig. 1: Examples of T-distributions

4. Sample Size Determination

Recall that our general form of the confidence interval is

 σest is our estimate of the population standard deviation.

Exercise - Confidence Intervals

 Estimates can be point estimates or interval estimates.

Question 1: Correct answer is option 4, 217.

1. Sampling Distribution for a Proportion ........................................................................................... 4

1. Sampling Distribution for a Proportion

Fig. 1: Schematic Diagram of Sampling for a Population Proportion

This formula states that the sample proportion is equal to x divided by n.

Where x is the number of successes and n is the number of trials.

And the standard deviation of is given by

2. Confidence Interval for a Proportion

In both cases the conditions are satisfied.

Therefore the 99% confidence interval is given by

3. Sample Size Determination

 Z is a value corresponding to our desired level of confidence

Below is an exercise to practice estimating sample size for proportions.

Here is a quick recap of what we have learnt so far:

1. Null and Alternative Hypotheses .................................................................................................... 4

1. Null and Alternative Hypotheses

Two important issues that should be kept in mind are:

2. One-Tailed and Two-Tailed Tests

Hypothesis Testing Methods

One-tailed hypothesis test

Two-tailed hypothesis test

3. Type I and Type II Errors

1. The mistake of rejecting a true H0 is called a Type I error