S 15 Notes
S 15 Notes
S 15 Notes
Lecture Notes
Joshua M. Tebbs
Department of Statistics
University of South Carolina
TABLE OF CONTENTS
Contents
1 Introduction
2 Probability
2.1
2.2
10
2.3
Axioms of probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
2.4
12
2.5
Probability rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
2.6
Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
3 Discrete Distributions
19
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
3.2
Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
3.3
Geometric distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
3.4
32
3.5
Hypergeometric distribution . . . . . . . . . . . . . . . . . . . . . . . . .
34
3.6
Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
4 Continuous Distributions
39
4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
4.2
Exponential distribution . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
4.3
Gamma distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
4.4
Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
56
5.1
Weibull distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
5.2
Reliability functions
60
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
TABLE OF CONTENTS
5.3
63
5.4
Quantile-quantile plots . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
6 Statistical Inference
68
6.1
68
6.2
70
6.3
73
6.4
Sampling distribution of Y . . . . . . . . . . . . . . . . . . . . . . . . . .
75
6.5
77
6.6
The t distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
79
6.7
83
7 One-Sample Inference
85
7.1
86
7.2
93
7.3
7.4
7.4.2
8 Two-Sample Inference
8.1
109
8.1.2
8.1.3
8.2
Condence interval for the ratio of two population variances 22 /12 . . . . 124
8.3
134
ii
TABLE OF CONTENTS
9.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
9.2
9.3
152
170
197
iii
CHAPTER 1
Introduction
Denition: Statistics is the science of data; how to interpret data, analyze data, and
design studies to collect data.
Statistics is used in all disciplines; not just in engineering.
Statisticians get to play in everyone elses back yard. (John Tukey)
Examples:
1. In a reliability (time to event) study, engineers are interested in describing the time
until failure for a jet engine fan blade.
2. In a genetics study involving patients with Alzheimers disease (AD), researchers
wish to identify genes that are dierentially expressed (when compared to non-AD
patients).
3. In an agricultural experiment, researchers want to know which of four fertilizers
(which vary in their nitrogen contents) produces the highest corn yield.
4. In a clinical trial, physicians want to determine which of two drugs is more eective
for treating HIV in the early stages of the disease.
5. In a public health study involving at-risk teenagers, epidemiologists want to know
whether smoking is more common in a particular demographic class.
6. A food scientist is interested in determining how dierent feeding schedules (for
pigs) could aect the spread of salmonella during the slaughtering process.
7. A pharmacist is concerned that administering caeine to premature babies will
increase the incidence of necrotizing enterocolitis.
8. A research dietician wants to determine if academic achievement is related to body
mass index (BMI) among African American students in the fourth grade.
PAGE 1
CHAPTER 1
What we do: Statisticians use their skills in mathematics and computing to formulate
statistical models and analyze data for a specic problem at hand. These models are
then used to estimate important quantities of interest, to test the validity of proposed
conjectures, and to predict future behavior. Being able to identify and model sources of
variability is an important part of this process.
Denition: A deterministic model is one that makes no attempt to explain variability.
For example, in chemistry, the ideal gas law states that
P V = nRT,
where P = pressure of a gas, V = volume, n = the amount of substance of gas (number
of moles), R = Boltzmanns constant, and T = temperature. In circuit analysis, Ohms
law states that
V = IR,
where V = voltage, I = current, and R = resistance.
In both of these models, the relationship among the variables is completely determined without ambiguity.
In real life, this is rarely true for the obvious reason: there is natural variation that
arises in the measurement process.
For example, a common electrical engineering experiment involves setting up a
simple circuit with a known resistance R. For a given current I, dierent students
will then measure the voltage V .
With a sample of n = 20 students, conducting the experiment in succession,
we might very well get 20 dierent measured voltages!
A deterministic model is too simplistic; it does not acknowledge the inherent
variability that arises in the measurement process.
Usefulness: Statistical models are not deterministic. They incorporate variability.
They can also be used to predict future outcomes.
PAGE 2
CHAPTER 1
4.0
60
3.5
HS GPA
80
70
90
100
Here are SAT/HS GPA data on n = 50 freshmen and their nal MATH 141 scores:
3.0
50
2.5
2.0
400
500
600
700
800
SAT MATH
CHAPTER 2
Note: A statistical model for Y might look like something like this:
Y = 0 + 1 x1 + 2 x2 + ,
where is a term that accounts for not only measurement error (e.g., incorrect student
information, data entry errors, grading errors, etc.) but also
all of the other variables not accounted for (e.g., major, diculty of schedule, study
habits, natural ability, etc.) and
the error induced by assuming a linear relationship between Y and x1 and x2
when, in fact, the relationship may not be.
Discussion:
Is this sample of students representative of some larger population? After all, we
would like our model/predictions to be useful on a larger scale (and not simply for
these 50 students).
This is the idea behind statistical inference. We would like to use sample
information to make statements about a larger (relevant) population.
How should we estimate 0 , 1 , and 2 in the model above?
If we can do this, then we can produce predictions of Y on a student-bystudent basis (e.g., for future students, etc.).
This may be of interest to academic advisers who are trying to model the
success of their incoming students.
We can also characterize numerical uncertainty with our predictions.
Probability is the mathematics of uncertainty and forms the basis for all of
statistics. Therefore, we start here.
PAGE 4
CHAPTER 2
Probability
2.1
Terminology: The set of all possible outcomes for a given random experiment is called
the sample space, denoted by S.
The number of outcomes in S is denoted by nS .
Example 2.1. In each of the following random experiments, we write out a corresponding sample space.
(a) The Michigan state lottery calls for a three-digit integer to be selected:
S = {000, 001, 002, ..., 998, 999}.
The size of the set of all possible outcomes is nS = 1000.
(b) A USC undergraduate student is tested for chlamydia (0 = negative, 1 = positive):
S = {0, 1}.
The size of the set of all possible outcomes is nS = 2.
PAGE 5
CHAPTER 2
(c) Four equally qualied applicants (a, b, c, d) are competing for two positions. If the
positions are identical (so that selection order does not matter), then
S = {ab, ac, ad, bc, bd, cd}.
The size of the set of all possible outcomes is nS = 6. If the positions are dierent (e.g.,
project leader, assistant project leader, etc.), then
S = {ab, ba, ac, ca, ad, da, bc, cb, bd, db, cd, dc}.
In this case, the size of the set of all possible outcomes is nS = 12.
Terminology: Suppose that S is a sample space for a random experiment. We would
like to assign probability to an event A. This will quantify how likely the event is. The
probability that the event A occurs is denoted by P (A).
Terminology: Suppose that a sample space S contains nS < outcomes, each of which
is equally likely. If the event A contains nA outcomes, then
P (A) =
nA
.
nS
This is called an equiprobability model. Its main requirement is that all outcomes in
S are equally likely.
Important: If the outcomes in S are not equally likely, then this result is not
applicable.
Example 2.1 (continued). In the random experiments from Example 2.1, we use the
previous result to assign probabilities to events (if applicable).
(a) The Michigan state lottery calls for a three-digit integer to be selected:
S = {000, 001, 002, ..., 998, 999}.
There are nS = 1000 outcomes possible. Let the event
A = {000, 005, 010, 015, ..., 990, 995}
= {winning number is a multiple of 5}.
PAGE 6
CHAPTER 2
200
= 0.20.
1000
(c) Four equally qualied applicants (a, b, c, d) are competing for two positions. If the
positions are identical (so that selection order does not matter), then
S = {ab, ac, ad, bc, bd, cd}.
There are nS = 6 outcomes possible. If A is the event that applicant d is selected for one
of the two positions, then
A = {ad, bd, cd}
= {applicant d is chosen}.
There are nA = 3 outcomes in A. If each applicant has the same chance of being selected
(an assumption), then each of the nS = 6 outcomes in S is equally likely. Therefore,
P (A) =
3
= 0.50.
6
Again, this calculation is valid only if the outcomes in S are equally likely.
PAGE 7
CHAPTER 2
Interpretation: What does P (A) measure? There are two main interpretations:
P (A) measures the likelihood that A will occur on any given experiment.
If the experiment is performed many times, then P (A) can be interpreted as the
percentage of times that A will occur over the long run. This is called the relative
frequency interpretation.
Example 2.2. Suppose a baseballs team winning percentage is 0.571. We can interpret
this as the probability that the team will win a particular game. We can also interpret
this as the long-run percentage of games won (over the course of a season, say). I used
0.8
0.7
0.5
0.6
Winning Percentage
0.9
1.0
R to simulate this teams winning percentages over the course of a 162-game season.
50
100
150
Game
Figure 2.1: Plot of baseball teams winning percentage over 162 games. A horizontal line
at 0.571 has been added.
PAGE 8
CHAPTER 2
Curiosity: Why did I pick 0.571? This was the winning percentage of the Oakland As
during the 2002 season after 119 games (68-51). If you have seen Moneyball, you know
that the As then went on a 20-game winning streak (The Streak). To get an idea of
how amazing this run was, lets simulate 20 game outcomes assuming that 0.571 is the
correct winning percentage for each game:
> games = rbinom(20,1,0.571)
> games
[1] 1 0 0 1 1 1 1 1 0 0 0 1 1 1 0 1 1 1 0 1
> sum(games)
[1] 13
In this simulation, the As won 13 games out of 20.
Now, lets simulate the process of playing 20 games 1000 times. Lets keep track of the
number of times (out of 1000) that the team would win 20 games in a row:
> games = rbinom(1000,20,0.571)
> length(games[games>19])
[1] 0
In 1000 simulated 20-game stretches, the team never won 20 games in a row.
Lets simulate 10,000 20-game stretches:
> games = rbinom(10000,20,0.571)
> length(games[games>19])
[1] 0
In 10,000 simulated 20-game stretches, the team never won 20 games in a row.
Lets simulate 1,000,000 20-game stretches:
> games = rbinom(1000000,20,0.571)
> length(games[games>19])
[1] 13
PAGE 9
CHAPTER 2
In 1,000,000 simulated 20-game stretches, the team won 20 games in a row 13 times.
Using the relative frequency interpretation of probability, we could say that
P (winning 20 games in a row) 0.0000013.
2.2
CHAPTER 2
where + means the male ospring has the disease; means the male does not have
the disease. To compute the probabilities, we will assume that each outcome
in S is equally likely. Dene the events:
A = {rst child has disease} = {++, +}
B = {second child has disease} = {++, +}.
The union and intersection of A and B are, respectively,
A B = {either child has disease} = {++, +, +}
A B = {both children have disease} = {++}.
The probability that either male child will have the disease is
P (A B) =
nAB
3
= = 0.75.
nS
4
The probability that both male children will have the disease is
P (A B) =
2.3
nAB
1
= = 0.25.
nS
4
Axioms of probability
P (Ai ).
P
Ai =
i=1
i=1
Ai = A1 A2 An
i=1
CHAPTER 2
2.4
Note: In some situations, we may have prior knowledge about the likelihood of other
events related to the event of interest. We can then incorporate this information into a
probability calculation.
Terminology: Let A and B be events in a sample space S with P (B) > 0. The
conditional probability of A, given that B has occurred, is
P (A|B) =
P (A B)
.
P (B)
P (B|A) =
P (A B)
.
P (A)
Similarly,
Example 2.4. In a company, 36 percent of the employees have a degree from a SEC
university, 22 percent of those employees with a degree from the SEC are engineers, and
30 percent of the employees are engineers. An employee is selected at random.
(a) Compute the probability that the employee is an engineer and is from the SEC.
(b) Compute the conditional probability that the employee is from the SEC, given
that s/he is an engineer.
CHAPTER 2
P (A B)
P (A B)
=
.
P (B)
0.36
Therefore,
P (A B) = 0.22(0.36) = 0.0792.
In part (b), we want P (B|A). From the denition of conditional probability:
P (B|A) =
P (A B)
0.0792
=
= 0.264.
P (A)
0.30
Important: Note that, in this example, the conditional probability P (B|A) and the
unconditional probability P (B) are not equal.
In other words, knowledge that A has occurred has changed the likelihood that
B occurs.
In other situations, it might be that the occurrence (or non-occurrence) of a companion event has no eect on the probability of the event of interest. This leads us
to the denition of independence.
P (A)P (B)
P (A B)
=
= P (A)
P (B)
P (B)
P (B|A) =
P (B A)
P (B)P (A)
=
= P (B).
P (A)
P (A)
and
CHAPTER 2
Example 2.5. In an engineering system, two components are placed in a series; that is,
the system is functional as long as both components are. Each component is functional
with probability 0.95. Dene the events
A1 = {component 1 is functional}
A2 = {component 2 is functional}
so that P (A1 ) = P (A2 ) = 0.95. Because we need both components to be functional, the
probability that the system is functional is given by P (A1 A2 ).
If the components operate independently, then A1 and A2 are independent events
and the system reliability is
P (A1 A2 ) = P (A1 )P (A2 ) = 0.95(0.95) = 0.9025.
If the components do not operate independently; e.g., failure of one component
wears on the other, we can not compute P (A1 A2 ) without additional knowledge.
Extension: The notion of independence extends to any nite collection of events
A1 , A2 , ..., An . Mutual independence means that the probability of the intersection of
any sub-collection of A1 , A2 , ..., An equals the product of the probabilities of the events
in the sub-collection. For example, if A1 , A2 , A3 , and A4 are mutually independent, then
P (A1 A2 ) = P (A1 )P (A2 )
P (A1 A2 A3 ) = P (A1 )P (A2 )P (A3 )
P (A1 A2 A3 A4 ) = P (A1 )P (A2 )P (A3 )P (A4 ).
2.5
Probability rules
CHAPTER 2
P (A|B)P (B)
P (A|B)P (B)
=
.
P (A)
P (A|B)P (B) + P (A|B)P (B)
Example 2.6. The probability that train 1 is on time is 0.95. The probability that train
2 is on time is 0.93. The probability that both are on time is 0.90. Dene the events
A1 = {train 1 is on time}
A2 = {train 2 is on time}.
We are given that P (A1 ) = 0.95, P (A2 ) = 0.93, and P (A1 A2 ) = 0.90.
(a) What is the probability that train 1 is not on time?
P (A1 ) = 1 P (A1 )
= 1 0.95 = 0.05.
(b) What is the probability that at least one train is on time?
P (A1 A2 ) = P (A1 ) + P (A2 ) P (A1 A2 )
= 0.95 + 0.93 0.90 = 0.98.
PAGE 15
CHAPTER 2
(c) What is the probability that train 1 is on time given that train 2 is on time?
P (A1 A2 )
P (A2 )
0.90
=
0.968.
0.93
P (A1 |A2 ) =
(d) What is the probability that train 2 is on time given that train 1 is not on time?
P (A1 A2 )
P (A1 )
P (A2 ) P (A1 A2 )
=
1 P (A1 )
0.93 0.90
=
= 0.60.
1 0.95
P (A2 |A1 ) =
PAGE 16
CHAPTER 2
(a) What is the probability that a new policy-holder will have an accident?
Solution: By the Law of Total Probability,
P (A) = P (A|B)P (B) + P (A|B)P (B)
= 0.4(0.3) + 0.2(0.7) = 0.26.
(b) Suppose that the policy-holder does have an accident. What is the probability that
s/he was accident-prone?
Solution: We want P (B|A). By Bayes Rule,
P (A|B)P (B)
P (A|B)P (B) + P (A|B)P (B)
0.4(0.3)
=
0.46.
0.4(0.3) + 0.2(0.7)
P (B|A) =
2.6
Random variables
CHAPTER 3
Example 2.8. Classify the following random variables as discrete or continuous and
specify the support of each random variable.
V
W = pH of an aqueous solution
X = length of time between accidents at a factory
Y
CHAPTER 3
Discrete Distributions
3.1
Introduction
pY (y) = 1.
all y
Example 3.1. A mail-order computer business has six telephone lines. Let Y denote
the number of lines in use at a specic time. Suppose that the probability mass function
(pmf) of Y is given by
y
pY (y)
0.10 0.15
0.20
0.25
0.20
0.06
0.04
Figure 3.1 (left) displays pY (y), the probability mass function (pmf) of Y .
The height of the bar above y is equal to pY (y) = P (Y = y).
If y is not equal to 0, 1, 2, 3, 4, 5, 6, then pY (y) = 0.
Figure 3.1 (right) displays the cumulative distribution function (cdf) of Y .
FY (y) = P (Y y).
PAGE 19
0.0
0.00
0.2
0.05
0.10
0.4
CDF
PMF
0.15
0.6
0.20
0.8
0.25
1.0
CHAPTER 3
(a) What is the probability that exactly two lines are in use?
pY (2) = P (Y = 2) = 0.20.
(b) What is the probability that at most two lines are in use?
P (Y 2) = P (Y = 0) + P (Y = 1) + P (Y = 2)
= pY (0) + pY (1) + pY (2)
= 0.10 + 0.15 + 0.20 = 0.45.
Note: This is also equal to FY (2) = 0.45 (see graph above).
PAGE 20
CHAPTER 3
Terminology: Let Y be a discrete random variable with pmf pY (y). The expected
value of Y is given by
= E(Y ) =
ypY (y).
all y
The expected value for a discrete random variable Y is a weighted average of the possible
values of Y . Each value y is weighted by its probability pY (y). In statistical applications,
= E(Y ) is called the population mean.
Example 3.1 (continued). In Example 3.1, we examined the distribution of Y , the
number of lines in use at a specied time. The probability mass function (pmf) of Y is
pY (y)
0.10 0.15
0.20
0.25
0.20
0.06
0.04
ypY (y)
all y
CHAPTER 3
g(y)pY (y).
all y
Suppose that
g, g1 , g2 , ..., gk are real-valued functions, and let c be any real constant. Expectations
satisfy the following (linearity) properties:
(a) E(c) = c
(b) E[cg(Y )] = cE[g(Y )]
pY (y)
0.2
0.3
0.3
0.2
(a) Compute the expected number of gallons produced during a one-hour period.
Solution: The expected value of Y is
= E(Y ) =
all y
Therefore, we would expect 1.5 gallons of the toxic chemical to be produced per hour
(on average).
PAGE 22
0.6
CDF
0.0
0.00
0.05
0.2
0.10
0.4
0.15
PMF
0.20
0.25
0.8
0.30
1.0
CHAPTER 3
Therefore,
E[g(Y )] = E(3 + 12Y + 2Y 2 ) = 3 + 12E(Y ) + 2E(Y 2 )
= 3 + 12(1.5) + 2(3.3) = 27.6.
The expected hourly cost is $2, 760.00.
Terminology: Let Y be a discrete random variable with pmf pY (y) and expected value
E(Y ) = . The variance of Y is given by
2 = var(Y ) = E[(Y )2 ]
=
(y )2 pY (y).
all y
PAGE 23
CHAPTER 3
2 =
var(Y ).
Computing Formula: Let Y be a random variable with mean E(Y ) = . An alternative computing formula for the variance is
var(Y ) = E[(Y )2 ]
= E(Y 2 ) [E(Y )]2 .
This formula is easy to remember and can make calculations easier.
Example 3.2 (continued). In Example 3.2, we examined the pmf for Y , the number of
gallons of a toxic chemical that is produced per hour. We computed
E(Y ) = 1.5
E(Y 2 ) = 3.3.
The variance of Y is
2 = var(Y ) = E(Y 2 ) [E(Y )]2
= 3.3 (1.5)2 = 1.05 (gallons)2
The standard deviation of Y is
=
CHAPTER 3
3.2
Binomial distribution
Examples:
When circuit boards used in the manufacture of Blue Ray players are tested, the
long-run percentage of defective boards is 5 percent.
circuit board = trial
defective board is observed = success
p = P (success) = P (defective board) = 0.05.
Ninety-eight percent of all air trac radar signals are correctly interpreted the rst
time they are transmitted.
radar signal = trial
signal is correctly interpreted = success
p = P (success) = P (correct interpretation) = 0.98.
Albino rats used to study the hormonal regulation of a metabolic pathway are
injected with a drug that inhibits body synthesis of protein. The probability that
a rat will die from the drug before the study is complete is 0.20.
rat = trial
dies before study is over = success
p = P (success) = P (dies early) = 0.20.
PAGE 25
CHAPTER 3
n py (1 p)ny , y = 0, 1, 2, ..., n
y
pY (y) =
0,
otherwise.
MEAN/VARIANCE: If Y b(n, p), then
E(Y ) = np
var(Y ) = np(1 p).
If the Bernoulli trial assumptions hold (independent plots, same response probability for
each plot), then
Y = the number of plots which respond b(n = 4, p = 0.4).
(a) What is the probability that exactly two plots respond?
( )
4
P (Y = 2) = pY (2) =
(0.4)2 (1 0.4)42
2
= 6(0.4)2 (0.6)2 = 0.3456.
PAGE 26
0.0
0.0
0.2
0.1
0.4
CDF
0.2
PMF
0.6
0.3
0.8
1.0
CHAPTER 3
Figure 3.3: PMF (left) and CDF (right) of Y b(n = 4, p = 0.4) in Example 3.3.
(b) What is the probability that at least one plot responds?
( )
4
P (Y 1) = 1 P (Y = 0) = 1
(0.4)0 (1 0.4)40
0
= 1 1(1)(0.6)4 = 0.8704.
(c) What are E(Y ) and var(Y )?
E(Y ) = np = 4(0.4) = 1.6
var(Y ) = np(1 p) = 4(0.4)(0.6) = 0.96.
Example 3.4. An electronics manufacturer claims that 10 percent of its power supply
units need servicing during the warranty period. Technicians at a testing laboratory
purchase 30 units and simulate usage during the warranty period. We interpret
power supply unit = trial
supply unit needs servicing during warranty period = success
p = P (success) = P (supply unit needs servicing) = 0.1.
PAGE 27
0.0
0.00
0.2
0.05
0.4
0.10
CDF
PMF
0.6
0.15
0.8
0.20
1.0
0.25
CHAPTER 3
10
15
20
25
30
10
15
20
25
30
Figure 3.4: PMF (left) and CDF (right) of Y b(n = 30, p = 0.1) in Example 3.4.
If the Bernoulli trial assumptions hold (independent units, same probability of needing
service for each unit), then
Y
pY (y) = P (Y = y)
FY (y) = P (Y y)
dbinom(y,n,p)
pbinom(y,n,p)
(a) What is the probability that exactly ve of the 30 power supply units require
servicing during the warranty period?
( )
30
pY (5) = P (Y = 5) =
(0.1)5 (1 0.1)305
5
dbinom(5,30,0.1) = 0.1023048.
PAGE 28
CHAPTER 3
(b) What is the probability at most ve of the 30 power supply units require service?
5 ( )
30
FY (5) = P (Y 5) =
(0.1)y (1 0.1)30y
y
y=0
pbinom(5,30,0.1) = 0.9268099.
(c) What is the probability at least ve of the 30 power supply units require service?
4 ( )
30
P (Y 5) = 1 P (Y 4) = 1
(0.1)y (1 0.1)30y
y
y=0
1-pbinom(4,30,0.1) = 0.1754949.
(d) What is P (2 Y 8)?
8 ( )
30
P (2 Y 8) =
(0.1)y (1 0.1)30y .
y
y=2
3.3
Geometric distribution
Note: The geometric distribution also arises in experiments involving Bernoulli trials:
1. Each trial results in a success or a failure.
2. The trials are independent.
3. The probability of success, denoted by p, 0 < p < 1, is the same on every trial.
PAGE 29
CHAPTER 3
(1 p)y1 p, y = 1, 2, 3, ...
pY (y) =
0,
otherwise.
MEAN/VARIANCE: If Y geom(p), then
1
p
1p
var(Y ) =
.
p2
E(Y ) =
Example 3.5. Biology students are checking the eye color of fruit ies. For each y, the
probability of observing white eyes is p = 0.25. We interpret
fruit y = trial
y has white eyes = success
p = P (success) = P (white eyes) = 0.25.
If the Bernoulli trial assumptions hold (independent ies, same probability of white eyes
for each y), then
Y
(a) What is the probability the rst white-eyed y is observed on the fth y checked?
pY (5) = P (Y = 5) = (1 0.25)51 (0.25)
= (0.75)4 (0.25) 0.079.
PAGE 30
0.0
0.00
0.2
0.05
0.4
0.10
CDF
PMF
0.6
0.15
0.8
0.20
1.0
0.25
CHAPTER 3
10
15
20
25
10
15
20
25
Figure 3.5: PMF (left) and CDF (right) of Y geom(p = 0.25) in Example 3.5.
(b) What is the probability the rst white-eyed y is observed before the fourth y is
examined? Note: For this to occur, we must observe the rst white-eyed y (success)
on either the rst, second, or third y.
FY (3) = P (Y 3) = P (Y = 1) + P (Y = 2) + P (Y = 3)
= (1 0.25)11 (0.25) + (1 0.25)21 (0.25) + (1 0.25)31 (0.25)
= 0.25 + 0.1875 + 0.140625 0.578.
GEOMETRIC R CODE: Suppose that Y geom(p).
pY (y) = P (Y = y)
FY (y) = P (Y y)
dgeom(y-1,p)
pgeom(y-1,p)
CHAPTER 3
3.4
Note: The negative binomial distribution also arises in experiments involving Bernoulli
trials:
1. Each trial results in a success or a failure.
2. The trials are independent.
3. The probability of success, denoted by p, 0 < p < 1, is the same on every trial.
pr (1 p)yr , y = r, r + 1, r + 2, ...
r
1
pY (y) =
0,
otherwise.
MEAN/VARIANCE: If Y nib(r, p), then
r
p
r(1 p)
var(Y ) =
.
p2
E(Y ) =
Example 3.6. At an automotive paint plant, 15 percent of all batches sent to the lab
for chemical analysis do not conform to specications. In this situation, we interpret
batch = trial
PAGE 32
0.0
0.00
0.2
0.01
0.4
0.02
CDF
PMF
0.6
0.03
0.8
0.04
1.0
0.05
CHAPTER 3
10
20
30
40
50
60
70
20
40
60
Figure 3.6: PMF (left) and CDF (right) of Y nib(r = 3, p = 0.15) in Example 3.6.
batch does not conform = success
p = P (success) = P (not conforming) = 0.15.
If the Bernoulli trial assumptions hold (independent batches, same probability of nonconforming for each batch), then
Y
(a) What is the probability the third nonconforming batch is observed on the tenth batch
sent to the lab?
(
)
10 1
pY (10) = P (Y = 10) =
(0.15)3 (1 0.15)103
31
( )
9
=
(0.15)3 (0.85)7 0.039.
2
(b) What is the probability no more than two nonconforming batches will be observed
among the rst 30 batches sent to the lab? Note: This means the third nonconforming
PAGE 33
CHAPTER 3
batch must be observed on the 31st batch tested, the 32nd, the 33rd, etc.
P (Y 31) = 1 P (Y 30)
)
30 (
y1
= 1
(0.15)3 (0.85)y3 0.151.
3
1
y=3
NEGATIVE BINOMIAL R CODE: Suppose that Y nib(r, p).
pY (y) = P (Y = y)
FY (y) = P (Y y)
dnbinom(y-r,r,p)
pnbinom(y-r,r,p)
3.5
Hypergeometric distribution
Setting: Consider a population of N objects and suppose that each object belongs to
one of two dichotomous classes: Class 1 and Class 2. For example, the objects (classes)
might be people (infected/not), parts (conforming/not), plots of land (respond to treatment/not), etc. In the population of interest, we have
N = total number of objects
r = number of objects in Class 1
N r = number of objects in Class 2.
Envision taking a sample n objects from the population (objects are selected at random
and without replacement). Dene
Y = the number of objects in Class 1 (out of the n selected).
We say that Y has a hypergeometric distribution. Notation: Y hyper(N, n, r).
PAGE 34
CHAPTER 3
y ( n)
, y r and n y N r
N
pY (y) =
0,
otherwise.
MEAN/VARIANCE: If Y hyper(N, n, r), then
(r)
E(Y ) = n
N
( r ) (N r) (N n)
.
var(Y ) = n
N
N
N 1
Example 3.7. A supplier ships parts to a company in lots of 100 parts. The company
has an acceptance sampling plan which adopts the following acceptance rule:
....sample 5 parts at random and without replacement. If there are no defectives in the sample, accept the entire lot; otherwise, reject the entire lot.
The population size is N = 100. The sample size is n = 5. Dene the random variable
Y
( )( )
10 90
1(43949268)
0
5
pY (0) = P (Y = 0) = ( ) =
0.584.
100
75287520
5
(b) If r = 10, what is the probability that at least 3 of the 5 parts sampled are defective?
P (Y 3) = 1 P (Y 2)
= 1 [P (Y = 0) + P (Y = 1) + P (Y = 2]
( )( ) ( )( ) ( )( )
10 90
10 90
10 90
0
5
1
4
2
3
(
)
(
)
(
)
= 1
+
+
100
100
100
5
5
5
1 (0.584 + 0.339 + 0.070) = 0.007.
PAGE 35
CDF
0.0
0.0
0.1
0.2
0.2
0.4
0.3
PMF
0.6
0.4
0.8
0.5
0.6
1.0
CHAPTER 3
Figure 3.7: PMF (left) and CDF (right) of Y hyper(N = 100, n = 5, r = 10) in
Example 3.7.
HYPERGEOMETRIC R CODE: Suppose that Y hyper(N, n, r).
pY (y) = P (Y = y)
FY (y) = P (Y y)
dhyper(y,r,N-r,n) phyper(y,r,N-r,n)
3.6
Poisson distribution
PAGE 36
CHAPTER 3
e , y = 0, 1, 2, ...
y!
pY (y) =
0,
otherwise.
MEAN/VARIANCE: If Y Poisson(), then
E(Y ) =
var(Y ) = .
Example 3.8. Let Y denote the number of times per month that a detectable amount
of radioactive gas is recorded at a nuclear power plant. Suppose that Y follows a Poisson
distribution with mean = 2.5 times per month.
(a) What is the probability that there are exactly three times a detectable amount of
gas is recorded in a given month?
(2.5)3 e2.5
3!
15.625e2.5
0.214.
=
6
P (Y = 3) = pY (3) =
PAGE 37
0.6
0.4
CDF
0.15
0.0
0.00
0.05
0.2
0.10
PMF
0.20
0.8
0.25
1.0
CHAPTER 4
10
10
Figure 3.8: PMF (left) and CDF (right) of Y Poisson( = 2.5) in Example 3.8.
(b) What is the probability that there are no more than four times a detectable
amount of gas is recorded in a given month?
P (Y 4) = P (Y = 0) + P (Y = 1) + P (Y = 2) + P (Y = 3) + P (Y = 4)
(2.5)0 e2.5 (2.5)1 e2.5 (2.5)2 e2.5 (2.5)3 e2.5 (2.5)4 e2.5
+
+
+
+
=
0!
1!
2!
3!
4!
0.891.
POISSON R CODE: Suppose that Y Poisson().
pY (y) = P (Y = y)
FY (y) = P (Y y)
dpois(y,)
ppois(y,)
PAGE 38
CHAPTER 4
Continuous Distributions
4.1
Introduction
PAGE 39
CHAPTER 4
= FY (b) FY (a).
Result: If a is a specic value, then P (Y = a) = 0. In other words, in continuous probability models, specic points are assigned zero probability. An immediate consequence
of this is that if Y is continuous,
P (a Y b) = P (a Y < b) = P (a < Y b) = P (a < Y < b)
and each is equal to
fY (y)dy.
a
This is not true if Y has a discrete distribution because positive probability is assigned
to specic values of Y .
Remark: Evaluating a pdf at a specic value a, that is, computing fY (a), does not give
you a probability. That is,
fY (a) = a probability (of any type).
Compare this to calculating FY (a), which gives the cumulative probability P (Y a).
Example 4.1. Suppose that Y has the pdf
3y 2 , 0 < y < 1
fY (y) =
0, otherwise.
(a) Find the cumulative distribution function (cdf) of Y .
Solution. Before you do any calculations, rst note that
if y 0, then FY (y) = 0.
if y 1, then FY (y) = 1.
PAGE 40
CDF
0.0
0.0
0.5
0.2
1.0
0.4
1.5
0.6
2.0
0.8
2.5
3.0
1.0
CHAPTER 4
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.5
FY (y) =
fY (t)dt =
y
3t2 dt = t3 = y 3 .
0, y 0
FY (y) =
y3, 0 < y < 1
1, y 1.
1.0
CHAPTER 4
0.3
= E(Y ) =
yfY (y)dy.
The limits of the integral in this denition, while technically correct, will always be the
lower and upper limits corresponding to the nonzero part of the pdf.
Result: Let Y be a continuous random variable with pdf fY (y). Suppose that g is a
real-valued function. Then, g(Y ) is a random variable and
E[g(Y )] =
g(y)fY (y)dy.
Terminology: Let Y be a continuous random variable with pdf fY (y) and expected
value E(Y ) = . The variance of Y is given by
2 = var(Y ) = E[(Y )2 ]
=
(y )2 fY (y)dy.
2 =
var(Y ).
CDF
0.0
0.2
0.4
10
0.6
15
0.8
20
1.0
CHAPTER 4
12.5
12.6
12.7
12.8
12.9
13.0
12.4
12.5
12.6
12.7
12.8
12.9
13.0
0,
otherwise.
FY (y) =
0,
y 12.5
The pdf (left) and cdf (right) of Y are graphed in Figure 4.2. The expected value of Y
is given by
= E(Y ) =
yfY (y)dy
12.5
20ye20(y12.5) dy = 12.55.
12.5
CHAPTER 4
= var(Y ) =
12.5
(y )2 fY (y)dy
20(y 12.55)2 e20(y12.5) dy = 0.0025.
12.5
Exercise: In Example 4.2, what proportion of diameters will exceed 12.65 mm?
Terminology: Suppose Y is a continuous random variable and let 0 < p < 1. The pth
quantile of the distribution of Y , denoted by p , solves
p
FY (p ) = P (Y p ) =
fY (y)dy = p.
The median of Y is the p = 0.5 quantile. That is, the median 0.5 solves
0.5
FY (0.5 ) = P (Y 0.5 ) =
fY (y)dy = 0.5.
and solve for 0.5 . We obtain 0.5 12.535. Therefore, 50 percent of the diameters will
be less than 12.535 mm (of course, 50 percent will be greater than this value too).
PAGE 44
CHAPTER 4
4.2
Exponential distribution
ey , y > 0
fY (y) =
0,
otherwise.
Notation: Y exponential(). The exponential distribution is used to model the
1.0
=2
=1
=1/2
0.0
0.5
1.5
2.0
1
var(Y ) =
.
2
E(Y ) =
PAGE 45
CDF
0.0
0.0
0.2
0.1
0.4
0.2
0.6
0.3
0.8
0.4
1.0
CHAPTER 4
10
15
10
15
Figure 4.4: PDF (left) and CDF (right) of Y exponential( = 0.4) in Example 4.3.
CDF: If Y exponential(), then the cdf of Y exists in closed form and is given by
0,
y0
FY (y) =
1 ey , y > 0.
Example 4.3. Experience with fans used in diesel engines has suggested that the exponential distribution provides a good model for time until failure (i.e., lifetime). Suppose
that the lifetime of a fan, denoted by Y (measured in 10000s of hours), follows an exponential distribution with = 0.4.
(a) What is the probability that a fan lasts longer than 30,000 hours?
Solution. Using the pdf of Y , we calculate
P (Y > 3) =
3
)
1
0.4e0.4y dy = 0.4 e0.4y
0.4
3
= e0.4y
(
= e
PAGE 46
1.2
0.301.
CHAPTER 4
P (2 < Y < 5) =
2
5
1 0.4y
0.4y
0.4e
dy = 0.4 e
0.4
2
5
= e0.4y
2
0.4(5)
= [e
e0.4(2) ]
= e0.8 e2 0.314.
Using the cdf of Y ,
P (2 < Y < 5) = FY (5) FY (2)
= [1 e0.4(5) ] [1 e0.4(2) ]
= e0.8 e2 0.314.
MEMORYLESS PROPERTY: Suppose that Y exponential(), and let r and s be
positive constants. Then
P (Y > r + s|Y > r) = P (Y > s).
If Y measures time (e.g., time to failure, etc.), then the memoryless property says that
the distribution of additional lifetime (s time units beyond time r) is the same as the
original distribution of the lifetime. In other words, the fact that Y has made it to
time r has been forgotten. For example, in Example 4.3,
P (Y > 5|Y > 2) = P (Y > 3) 0.301.
PAGE 47
CHAPTER 4
and solve for 0.9 . Doing so gives 0.9 0.192. This means that 90 percent of all
rst-customer waiting times will be less than 0.192 hours (only 10 percent will exceed).
EXPONENTIAL R CODE: Suppose that Y exponential().
FY (y) = P (Y y)
pexp(y,)
qexp(p,)
CHAPTER 4
4.3
Gamma distribution
for all > 0. The gamma function satises the recursive relationship
() = ( 1)( 1),
for > 1. Therefore, if is an integer, then
() = ( 1)!.
y 1 ey , y > 0
()
fY (y) =
0,
otherwise.
Notation: Y gamma(, ).
By changing the values of and , the gamma pdf can assume many shapes; see
examples in Figure 4.5.
This makes the gamma distribution popular for modeling positive random variables
(it is more exible than the exponential).
Note that when = 1, the gamma pdf reduces to the exponential() pdf.
var(Y ) =
.
2
E(Y ) =
PAGE 49
0.15
=1.5, =0.6
=2.0, =0.5
=2.5, =0.4
0.00
0.05
0.10
0.20
0.25
0.30
CHAPTER 4
10
15
20
25
CDF: The cdf of a gamma random variable does not exist in closed form. Therefore,
probabilities involving gamma random variables and gamma quantiles must be computed
numerically (e.g., using R, etc.).
GAMMA R CODE: Suppose that Y gamma(, ).
FY (y) = P (Y y)
pgamma(y,,)
qgamma(p,,)
Example 4.5. When a certain transistor is subjected to an accelerated life test, the
lifetime Y (in weeks) is modeled by a gamma distribution with = 4 and = 1/6.
PAGE 50
CDF
0.0
0.00
0.2
0.01
0.4
0.02
0.6
0.03
0.8
1.0
0.04
CHAPTER 4
20
40
60
80
20
40
60
80
Figure 4.6: PDF (left) and CDF (right) of Y gamma( = 4, = 1/6) in Example 4.5.
(a) Find the probability that a transistor will last at least 50 weeks.
P (Y 50) = 1 P (Y < 50) = 1 FY (50)
= 1-pgamma(50,4,1/6)
= 0.03377340.
(b) Find the probability that a transistor will last between 12 and 24 weeks.
P (12 Y 24) = FY (24) FY (12)
= pgamma(24,4,1/6)-pgamma(12,4,1/6)
= 0.4236533.
(c) Twenty percent of the transistor lifetimes will be below which time? Note: I am
asking for the 0.20 quantile (20th percentile) of the lifetime distribution.
> qgamma(0.2,4,1/6)
[1] 13.78072
Therefore, 20 percent of the resistors will fail before 13.78 weeks.
PAGE 51
CHAPTER 4
4.4
Normal distribution
0,
otherwise.
Remark: The normal distribution serves as a very good model for a wide range of measurements; e.g., reaction times, ll amounts, part dimensions, weights/heights, measures
of intelligence/test scores, economic indicators, etc.
CDF: The cdf of a normal random variable does not exist in closed form. Probabilities
involving normal random variables and normal quantiles can be computed numerically
(e.g., using R, etc.). There are other antiquated methods of calculating normal probabilities/quantiles using probability tables (we will avoid like the plague).
PAGE 52
0.2
=0, =1
=2, =2
=1, =3
0.0
0.1
0.3
0.4
CHAPTER 4
10
10
pnorm(y,,)
qnorm(p,,)
Example 4.6. The time it takes for a driver to react to the brake lights on a decelerating
vehicle is critical in helping to avoid rear-end collisions. For a population of drivers (e.g.,
drivers in SC, etc.), suppose that
Y = the reaction time to brake during in-trac driving (measured in seconds),
follows a normal distribution with mean = 1.5 and variance 2 = 0.16.
PAGE 53
0.8
0.6
0.4
0.2
0.0
0.0
0.2
0.4
CDF
0.6
0.8
1.0
1.0
CHAPTER 4
0.0
0.5
1.0
1.5
2.0
2.5
3.0
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Figure 4.8: PDF (left) and CDF (right) of Y N ( = 1.5, 2 = 0.16) in Example 4.6.
(a) What is the probability that reaction time is less than 1 second?
P (Y < 1) = FY (1)
= pnorm(1,1.5,sqrt(0.16))
= 0.1056498.
(b) What is the probability that reaction time is between 1.1 and 2.5 seconds?
P (1.1 < Y < 2.5) = FY (2.5) FY (1.1)
= pnorm(2.5,1.5,sqrt(0.16))-pnorm(1.1,1.5,sqrt(0.16))
= 0.835135.
(c) Five percent of all reaction times will exceed which time? Note: I am asking for
the 0.95 quantile (95th percentile) of the reaction time distribution.
> qnorm(0.95,1.5,sqrt(0.16))
[1] 2.157941
Therefore, the slowest 5 percent of the population will have reaction times larger than
2.16 seconds.
PAGE 54
CHAPTER 5
0,
otherwise.
Y
N (0, 1).
The result says that Z follows a standard normal distribution; i.e., Z N (0, 1). In this
context, Z is called the standardized value of Y . For example,
if the standardized value of y is z = 1.5, this means that y is 1.5 standard deviations
above the mean .
if the standardized value of y is z = 1.5, this means that y is 1.5 standard
deviations below the mean .
PAGE 55
CHAPTER 5
Terminology: Reliability analysis deals with failure time (i.e., lifetime, time-to-event)
data. For example,
T = time from start of product service until failure
T = time until a warranty claim
T = number of hours in use/cycles until failure.
We call T a lifetime random variable if it measures the time to an event; e.g.,
failure, death, eradication of some infection/condition, etc. Engineers are often involved
with reliability studies, because reliability is strongly related to product quality.
Note: There are many well known lifetime distributions, including
exponential
Weibull
Others: gamma, lognormal, inverse Gaussian, Gompertz-Makeham, BirnbaumSanders, extreme value, log-logistic, etc.
5.1
Weibull distribution
t
e(t/) , t > 0
fT (t) =
0,
otherwise.
Notation: T Weibull(, ). We call
= shape parameter
= scale parameter.
PAGE 56
0.15
CHAPTER 5
0.00
0.05
0.10
=2, =5
=2, =10
=3, =10
10
15
20
25
As you can see in Figure 5.1, by changing the values of and , the Weibull pdf
can assume many shapes. Because of this exibility (and for other reasons), the
Weibull distribution is very popular among engineers in reliability applications.
When = 1, the Weibull pdf reduces to the exponential( = 1/) pdf.
{ (
) [ (
)]2 }
2
1
var(T ) = 2 1 +
1+
.
PAGE 57
CDF
0.4
0.04
0.0
0.00
0.2
0.02
0.6
0.06
0.8
0.08
1.0
CHAPTER 5
10
15
20
25
30
10
15
20
25
30
Figure 5.2: PDF (left) and CDF (right) of T Weibull( = 2, = 10) in Example 5.1.
CDF: Suppose that T Weibull(, ). The cdf of T exists in closed form and is given
by
FT (t) =
0,
t0
1 e(t/) , t > 0.
Example 5.1. The lifetime of a rechargeable battery under constant usage conditions,
denoted by T (measured in hours), follows a Weibull distribution with parameters = 2
and = 10.
(a) What is the mean time to failure?
( )
3
E(T ) = 10
8.862 hours.
2
(b) What is the probability that a battery is still functional at time t = 20?
P (T 20) = 1 P (T < 20) = 1 FT (20)
= 1 [1 e(20/10) ]
2
0.018.
PAGE 58
CHAPTER 5
(c) What is the probability that a battery is still functional at time t = 20 given that
the battery is functional at time t = 10?
Solution. This is a conditional probability. We are given that the battery has survived to at least 10 hours.
P (T 20|T 10) =
P (T 20 and T 10)
P (T 20)
=
P (T 10)
P (T 10)
1 FT (20)
=
1 FT (10)
e(20/10)
= (10/10)2 0.050.
e
2
Therefore, the Weibull distribution does not satisfy the memoryless property.
(d) What is the 99th percentile of this lifetime distribution? We set
FT (0.99 ) = 1 e(0.99 /10) = 0.99.
2
set
Solving for 0.99 gives 0.99 21.460 hours. Only one percent of the battery lifetimes
will exceed this value.
WEIBULL R CODE: Suppose that T Weibull(, ).
FT (t) = P (T t)
pweibull(t,,)
qweibull(p,,)
CHAPTER 5
5.2
Reliability functions
Description: We now describe some dierent, but equivalent, ways of dening the
distribution of a (continuous) lifetime random variable T .
The cumulative distribution function (cdf )
FT (t) = P (T t).
This can be interpreted as the proportion of units that have failed by time t.
The survivor function
ST (t) = P (T > t) = 1 FT (t).
This can be interpreted as the proportion of units that have not failed by time t;
e.g., the unit is still functioning, a warranty claim has not been made, etc.
The probability density function (pdf )
fT (t) =
d
d
FT (t) = ST (t).
dt
dt
FT (t) =
fT (u)du
0
and
ST (t) =
fT (u)du.
t
hT (t) = lim
CHAPTER 5
Distributions with increasing hazard functions are seen in units where some kind
of aging or wear out takes place. The population gets weaker over time.
Distributions with decreasing hazard functions correspond to the population getting
stronger over time. For example, certain types of units (e.g., electronic devices,
etc.) may display a decreasing hazard function, at least in the early stages.
In some populations, the hazard function decreases initially, stays constant for a
period of time, and then increases. This corresponds to a population whose units
get stronger initially (defective individuals die out early), exhibit random failures
for a period of time (constant hazard), and then eventually the population weakens
(e.g., due to old age, etc.). These hazard functions are bathtub-shaped.
Result: The hazard function can be calculated if we know the pdf fT (t) and the survivor
function ST (t). To see why, note that
P (t T < t + |T t)
0
P (t T < t + )
= lim
0
P (T t)
1
fT (t)
FT (t + ) FT (t)
=
=
lim
.
P (T t) |0
{z
} ST (t)
hT (t) = lim
d
= dt
FT (t)
We can therefore describe the distribution of T by using either fT (t), FT (t), ST (t), or
hT (t). If we know one of these functions, we can retrieve the other three.
Example 5.2. In this example, we nd the hazard function for T Weibull(, ).
Recall that the pdf of T is
( )1
t
e(t/) , t > 0
fT (t) =
0,
otherwise.
The cdf of T is
FT (t) =
0,
t0
1 e(t/) , t > 0.
PAGE 61
2.0
1.0
HAZARD
30
20
0
0.0
10
HAZARD
40
3.0
CHAPTER 5
5
4
3
0
0.6
HAZARD
1.2
1.0
0.8
HAZARD
1.4
2
t
Figure 5.3: Weibull hazard functions with = 1. Upper left: = 3. Upper right:
= 1.5. Lower left: = 1. Lower right: = 0.5.
1,
t0
e(t/) , t > 0.
hT (t) =
fT (t)
=
ST (t)
( )1
t
( )1
e(t/)
t
=
.
e(t/)
PAGE 62
CHAPTER 5
Interpretation: Plots of Weibull hazard functions are given in Figure 5.3. It is easy to
show that for a Weibull distribution:
hT (t) is increasing if > 1 (wear out; population gets weaker)
hT (t) is constant if = 1 (random failures; exponential distribution)
hT (t) is decreasing if < 1 (infant mortality; population gets stronger).
5.3
Example 5.3. The data below are times, denoted by T (measured in months), to the
rst failure for 20 electric carts used for internal delivery and transportation in a large
manufacturing facility.
3.9
4.2
5.4
6.5
7.0
8.8
9.2
11.4
14.3
15.1
18.0
19.0
19.0
23.9
24.8
26.0
34.2
ti
e(ti /)
L(, ) =
fT (ti ) =
i=1
i=1
)1
( )20 (
20
20
i=1 (ti /) ,
t
e
=
i
i=1
where t1 , t2 , ..., t20 are the 20 times above. The values of and that most closely
agree with the data are the values of and that maximize L(, ).
Denition: Let b and b denote the values of and , respectively, that maximize
L(, ). We call b and b the maximum likelihood estimates of and .
PAGE 63
CHAPTER 5
scale
1.9889187
16.9359807
(0.3504024)
(2.0081518)
For the cart data, the maximum likelihood estimates of and are
b 1.99
b 16.94.
The b 1.99 estimate suggests that there is wear out taking place among the carts;
that is, the population of carts gets weaker as time passes (see the estimated hazard
function on the next page).
Note: I have plotted the estimated pdf, the estimated cdf, the estimated survivor function, and the estimated hazard function in Figure 5.4. We use the term estimated,
because these functions are constructed using the estimates b 1.99 and b 16.94.
(a) Using the estimated Weibull(b 1.99, b 16.94) distribution as a model for future
cart lifetimes, nd the probability that a cart will survive past 20 months.
P (T > 20) = 1 P (T 20) = 1 FT (20)
= 1 [1 e(20/16.94)
1.99
] 0.249.
(b) Use the estimated distribution to nd the 90th percentile of the cart lifetimes.
FT (0.90 ) = 1 e(0.90 /16.94)
1.99
set
= 0.90.
Solving for 0.90 gives 0.90 25.75 months. Only ten percent of the cart lifetimes will
exceed this value.
PAGE 64
0.0
0.00
0.2
0.01
0.4
0.02
CDF
0.6
0.03
0.8
0.04
1.0
0.05
CHAPTER 5
10
20
30
40
50
10
20
40
50
30
40
50
0.20
HAZARD
0.0
0.00
0.05
0.2
0.10
0.15
0.6
0.4
SURVIVOR
0.25
0.8
0.30
1.0
0.35
30
10
20
30
40
50
10
20
t
Figure 5.4: Cart data in Example 5.3. Estimated Weibull functions with b = 1.99 and
b 16.94. Upper left: PDF. Upper right: CDF. Lower left: Survivor function. Lower
right: Hazard function.
PAGE 65
CHAPTER 5
Note: Here is the R code for answering the questions on page 64:
> 1-pweibull(20,1.99,16.94) ## Part (a)
[1] 0.248679
> qweibull(0.9,1.99,16.94) ## Part (b)
[1] 25.75914
5.4
Quantile-quantile plots
Terminology: A quantile-quantile plot (qq plot) is a graphical display that can help
assess the appropriateness of a model (distribution). Here is how the plot is constructed:
On the vertical axis, we plot the observed data, ordered from low to high.
On the horizontal axis, we plot the (ordered) theoretical quantiles from the distribution (model) assumed for the observed data.
Important: When you interpret qq plots, you are looking for general agreement.
The observed data will never line up perfectly with the models quantiles (due to natural
variability). In other words, dont be too picky when interpreting these plots, especially
with small sample sizes (like n = 20).
Cart data: I constructed the Weibull qq plot for the cart lifetime data in Example 5.3;
see Figure 5.5 on the next page.
PAGE 66
20
5
10
15
Cart data
25
30
35
CHAPTER 5
10
15
20
25
30
Weibull quantiles
Figure 5.5: Cart data: Weibull qq plot. The observed data are plotted versus the
theoretical quantiles from a Weibull distribution with b 1.99 and b 16.94.
There is a general agreement with the observed data and the quantiles from the
Weibull distribution. This suggests that the Weibull model is reasonable for the
cart data.
The straight line is formed from the 25th and 75th percentiles of the observed data
and the assumed model (here, a Weibull model with b 1.99 and b 16.94).
The bands about the line can be used to
get an idea of the variability allowed in the plot. If all of the data fall within
the bands, then there is no reason to suspect the model.
detect outlier observations (i.e., observations that are grossly inconsistent
with the assumed model).
PAGE 67
CHAPTER 6
Statistical Inference
6.1
Overview: We now focus on statistical inference. This deals with making (probabilistic) statements about a population of individuals based on information that is contained
in a sample taken from the population.
Example 6.1. Suppose we are studying the performance of lithium batteries used in
a certain calculator. We would like to learn about the lifetime of these batteries so
we can place a limited warranty on them in the future. Because this type of battery
has not been used in this calculator before, no one (except the Oracle) can tell us the
distribution of Y , the batterys lifetime. In fact, not only is the distribution not known,
but all parameters which index this distribution arent known either.
Terminology: A population refers to the entire group of individuals (e.g., parts,
people, batteries, etc.) about which we would like to make a statement (e.g., proportion
defective, median IQ score, mean lifetime, etc.).
It is generally accepted that the entire population can not be measured. It is too
large and/or it would be too time consuming to do so.
To draw inferences (make statements) about a population, we therefore observe a
sample of individuals from the population.
We will assume that the sample of individuals constitutes a random sample.
Mathematically, this means that all observations are independent and follow the
same probability distribution. Informally, this means that each sample (of the same
size) has the same chance of being selected.
Taking a random sample of individuals is our best hope of obtaining individuals
that are representative of the entire population.
PAGE 68
CHAPTER 6
4285
2066 2584
1009
318
1429
981
1402
1137
414
564
604
14
4152
737
852
1560
1786
520
396
1278
209
349
478
3032
1461
701
1406
261
83
205
602
3770
726
3894
2662
497
35
2778
1379
3920
1379
99
510
582
308
3367
99
373
454
In Figure 6.1, we display a histogram and boxplot of the battery lifetime data. We see
that the (empirical) distribution of the battery lifetimes is skewed towards the high side.
Which continuous probability distribution seems to display the same type of pattern
that we see in the histogram?
An exponential() model seems reasonable here (based on the histogram shape).
What is ?
In this example, is called a (population) parameter. It describes the distribution
which is used to model the entire population of batteries.
In general, (population) parameters which index probability distributions (like the
exponential) are unknown.
PAGE 69
2000
0
1000
Count
10
3000
15
4000
CHAPTER 6
1000
2000
3000
4000
Figure 6.1: Histogram (left) and boxplot (right) of the battery lifetime data (measured
in hours) in Example 6.1.
All the probability distributions we have discussed so far are meant to describe
population-level behavior.
6.2
Connection: All of the probability distributions that we talked about in Chapters 3-5
were indexed by population (model) parameters.
PAGE 70
CHAPTER 6
For example,
the N (, 2 ) distribution is indexed by two parameters, the population mean and
the population variance 2 .
the Poisson() distribution is indexed by one parameter, the population mean .
the Weibull(, ) distribution is indexed by two parameters, the shape parameter
and the scale parameter .
the b(n, p) distribution is indexed by one parameter, the population proportion of
successes p.
1
Y =
Yi .
n i=1
n
1
(Yi Y )2 .
n 1 i=1
n
S2 =
The sample standard deviation is the positive square root of the sample variance;
i.e.,
v
u
u
S = S2 = t
1
(Yi Y )2 .
n 1 i=1
n
Important: Unlike their population analogues (which are unknown), these quantities
can be computed from the sample Y1 , Y2 , ..., Yn .
Terminology: A statistic is a numerical quantity that can be calculated from a sample
of data. Some very common examples are:
Y
= sample mean
S 2 = sample variance
pb = sample proportion.
PAGE 71
CHAPTER 6
For example, with the battery lifetime data (a random sample of n = 50 lifetimes),
y = 1274.14 hours
s2 = 1505156 (hours)2
s 1226.85 hours.
> mean(battery) ## sample mean
[1] 1274.14
> var(battery) ## sample variance
[1] 1505156
> sd(battery) ## sample standard deviation
[1] 1226.848
y = 1274.14 is an estimate of the population mean .
s2 = 1505156 is an estimate of the population variance 2 .
s = 1226.848 is an estimate of the population standard deviation .
Summary: The table below succinctly summarizes the dierences between a population
and a sample (a parameter and a statistic):
Group of individuals
Numerical quantity
Status
Parameter
Unknown
Sample (Observed)
Statistic
CHAPTER 6
6.3
Remark: To keep our discussion as general as possible (as the material in this subsection
can be applied to many situations), we will let denote a population parameter.
For example, could denote a population mean, a population variance, a population
proportion, a Weibull or gamma model parameter, etc. It could also denote a
parameter in a regression context (Chapters 10-12).
Terminology: A point estimator b is a statistic that is used to estimate a population
parameter . Common examples of point estimators are:
Y
S 2
S
CHAPTER 6
b
b
se() = var().
In other words, the standard error is equal to the standard deviation of the sampling
b An estimators standard error measures the amount of variability in
distribution of .
b Therefore,
the point estimator .
b b more precise.
smaller se()
Illustration: In Example 5.3 (last chapter), we t a Weibull model to the cart data:
> fitdistr(cart.data,densfun="weibull")
shape
scale
1.9889187
16.9359807
(0.3504024)
(2.0081518)
PAGE 74
CHAPTER 6
The values b = 1.99 and b = 16.94 are point estimates of and , respectively.
The values underneath each (i.e., 0.35 and 2.01) are the associated standard errors.
Point estimates and standard errors are used to construct condence intervals
(which we will start in the next chapter).
6.4
Sampling distribution of Y
se(Y ) = var(Y ) =
= .
n
n
Example 6.2. In Example 4.6 (pp 53), we examined the distribution of
Y = time (in seconds) to react to brake lights during in-trac driving.
We assumed that Y N ( = 1.5, 2 = 0.16). We call this the population distribution,
because it describes the distribution of values of Y for all individuals in the population
(here, in-trac drivers).
(a) Suppose that we take a random sample of n = 5 drivers from the population with
times Y1 , Y2 , ..., Y5 . What is the distribution of the sample mean Y ?
Solution. If the sample size is n = 5, then with = 1.5 and 2 = 0.16, we have
(
)
2
Y N ,
= Y N (1.5, 0.032).
n
PAGE 75
CHAPTER 6
Population distribution
Sample mean, n=5
Sample mean, n=25
0.0
0.5
1.0
1.5
2.0
2.5
3.0
This distribution describes the values of Y we would expect to see in repeated sampling,
that is, if we repeatedly sampled n = 5 individuals from this population of in-trac
drivers and calculated the sample mean Y each time.
(b) Suppose that we take a random sample of n = 25 drivers from the population with
times Y1 , Y2 , ..., Y25 . What is the distribution of the sample mean Y ?
Solution. If the sample size is n = 25, then with = 1.5 and 2 = 0.16, we have
(
)
2
Y N ,
= Y N (1.5, 0.0064).
n
The sampling distribution of Y when n = 5 and when n = 25 is shown in Figure 6.2.
PAGE 76
CHAPTER 6
6.5
(
)
2
Y N ,
.
n
The Central Limit Theorem (Result 2) says that even if the population distribution
is not normal (Guassian), the sampling distribution of the sample mean Y will be
approximately normal (Gaussian) when the sample size is suciently large.
Example 6.3. The time to death for rats injected with a toxic substance, denoted by
Y (measured in days), follows an exponential distribution with = 1/5. That is,
Y exponential( = 1/5).
This is the population distribution. It describes the time to death for all rats in the
population.
In Figure 6.3, I have shown the exponential( = 1/5) population distribution (solid
curve). I have also depicted the theoretical sampling distributions of Y when n = 5
and when n = 25.
Notice how the sampling distribution of Y begins to (albeit distantly) resemble a
normal distribution when n = 5. When n = 25, the sampling distribution of Y
looks very much to be normal (Gaussian).
PAGE 77
0.3
0.4
CHAPTER 6
0.2
0.0
0.1
Population distribution
Sample mean, n=5
Sample mean, n=25
10
15
20
Figure 6.3: Rat death times. Population distribution: Y exponential( = 1/5). Also
depicted are the sampling distributions of Y when n = 5 and n = 25.
This is a consequence of the CLT. The larger the sample size n, the better a normal
(Gaussian) distribution represents the sampling distribution of Y .
Example 6.4. When a batch of a chemical product is prepared, the amount of an
impurity in the batch (measured in grams) is a random variable Y with:
= 4.0g
2 = (1.5g)2 .
Suppose that n = 50 batches are prepared (independently). What is the probability that
the sample mean impurity amount Y will be greater than 4.2 grams?
PAGE 78
CHAPTER 6
6.6
The t distribution
Y
N (0, 1).
/ n
If we replace the population standard deviation with the sample standard deviation S,
we get a new sampling distribution:
t=
Y
t(n 1),
S/ n
0.4
CHAPTER 6
0.2
0.0
0.1
0.3
N(0,1)
t(2)
t(10)
Figure 6.4: Probability density functions of N (0, 1), t(2), and t(10).
PAGE 80
CHAPTER 6
Remark: The t pdf formula is complicated and is unnecessary for our purposes. R will
compute probabilities and quantiles from the t distribution.
t R CODE: Suppose that T t().
FT (t) = P (T t)
pt(t,)
qt(p,)
Example 6.5. Hollow pipes are to be used in an electrical wiring project. In testing
1-inch pipes, the data below were collected by a design engineer. The data are measurements of Y , the outside diameter of this type of pipe (measured in inches). These
n = 25 pipes were randomly selected and measuredall in the same location.
1.296
1.320
1.311
1.298
1.315
1.305
1.278
1.294
1.311
1.290
1.284
1.287
1.289
1.292
1.301
1.298
1.287
1.302
1.304
1.301
1.313
1.315
1.306
1.289
1.291
The manufacturers of this pipe claim that the population distribution is normal (Gaussian) and that the mean outside diameter is = 1.29 inches. Under this assumption
(which may or may not be true), calculate the value of
t=
y
.
s/ n
Solution. We use R to nd the sample mean y and the sample standard deviation s:
PAGE 81
0.2
0.0
0.1
0.3
0.4
CHAPTER 6
Figure 6.5: t(24) probability density function. An at t = 4.096 has been added.
We compute
t=
1.299 1.29
4.096.
0.011/ 25
Analysis. If the manufacturers claim is true (that is, if = 1.29 inches), then
t=
s/ n
should come from a t(24) distribution. The t(24) pdf is displayed above in Figure 6.5. I
placed an at the value t = 4.096.
Discussion: Does t = 4.096 seem like a value you would expect to see from this distribution? Recall that t was computed under the assumption that = 1.29 inches (the
manufacturers claim). This value of t looks to be more consistent with a value of that
is larger than 1.29 inches.
PAGE 82
CHAPTER 6
6.7
Recall: Result 3 says that if Y1 , Y2 , ..., Yn is a random sample from a N (, 2 ) distribution, then
t=
Y
t(n 1).
S/ n
Answer: The t distribution result still approximately holds, even if the underlying
population distribution is not perfectly normal. The approximation is best when
the sample size is larger
the population distribution is more symmetric (not highly skewed).
Because normality (for the population distribution) is not absolutely critical for the t
sampling distribution, we say that this sampling distribution is robust to the normality
assumption.
Note: Robustness is a nice property. Here, it assures us that the underlying assumption of normality is not an absolute requirement for Result 3 to hold. Other sampling
distribution results (coming up) are not always robust to normality departures.
Terminology: Just as we used Weibull qq plots to assess the Weibull model assumption
in the last chapter, we can use a normal quantile-quantile (qq) plot to assess the
normal distribution assumption. The plot is constructed as follows:
On the vertical axis, we plot the observed data, ordered from low to high.
On the horizontal axis, we plot the (ordered) theoretical quantiles from the distribution (model) assumed for the observed data (here, normal).
PAGE 83
1.30
1.28
1.29
Pipe data
1.31
1.32
CHAPTER 7
N(0,1) quantiles
Figure 6.6: Pipe diameter data in Example 6.5. Normal qq plot. The observed data are
plotted versus the theoretical quantiles from a standard normal distribution. The line
added passes through the rst and third quartiles.
In the qqPlot function, it is default to take the assumed model to be the N (0, 1)
distribution (i.e., a normal distribution with mean 0 and variance 1).
Therefore, we are really comparing the standardized values of the observed data
(here, the pipe diameter data) to the N (0, 1) quantiles.
Linearity in the plot supports the normal assumption. Departures from linearity
refute it.
Pipe diameter data: Figure 6.6 shows the normal qq plot for the pipe diameter data
in Example 6.5. The plot does not set o any serious alarms.
PAGE 84
CHAPTER 7
One-Sample Inference
Preview: In this chapter, we discuss one-sample inference procedures for three population parameters:
A population mean (Section 7.1)
A population variance 2 (Section 7.2)
A population proportion p (Section 7.3).
Remember that these are population-level quantities, so they are unknown. Our goal is
to use sample information to estimate these quantities.
Relevance: To begin our discussion, suppose that we would like to estimate a population mean . To do so, suppose we have a random sample Y1 , Y2 , ..., Yn from a
population distribution (e.g., normal, exponential, Weibull, Poisson, etc.). Regardless of
what the population distribution is, we know that Y is an unbiased estimator for ,
that is,
E(Y ) = .
However, reporting Y alone does not acknowledge that there is variability attached to
this estimator. For example, in Example 6.5 (pp 81), with the n = 25 measured pipes,
reporting
y 1.299 in
as an estimate of the population mean does not account for the fact that
the 25 pipes measured were drawn randomly from a population of all pipes, and
dierent samples would give dierent sets of pipes (and dierent values of y).
In other words, using Y only ignores important information; namely, how variable
the population of pipes is.
PAGE 85
CHAPTER 7
Remedy: To address this problem, we therefore pursue the topic of interval estimation (also known as condence intervals). The main dierence between a point
estimate (like y 1.299) and an interval estimate is that
a point estimate is a one-shot guess at the value of the parameter; this ignores
the variability in the estimate.
an interval estimate (i.e., condence interval) is an interval of values. It is
formed by taking the point estimate and then adjusting it downwards and upwards to account for the point estimates variability. The end result is an interval
estimate.
7.1
Recall: We start our discussion by revisiting Result 3 in the last chapter (pp 79). Recall
that if Y1 , Y2 , ..., Yn is a random sample from a N (, 2 ) distribution, then the quantity
t=
Y
t(n 1),
S/ n
PAGE 86
0.2
0.1
0.3
0.4
CHAPTER 7
1
2
0.0
Figure 7.1: A t pdf with n1 degrees of freedom. The upper /2 and lower /2 areas are
shaded. The associated quantiles, represented in the gure by dark circles, are denoted
by tn1,/2 (upper) and tn1,/2 (lower), respectively.
CHAPTER 7
(
)
S
S
Y tn1,/2 , Y + tn1,/2
n
n
a 100(1 ) percent condence interval for the population mean . This is written
more succinctly as
S
Y tn1,/2 .
n
Discussion: Before we do an example, lets discuss relevant issues about this condence
interval.
Note the form of the interval:
point estimate quantile standard
error} .
{z
|
| {z }
|
{z
}
tn1,/2
S/ n
Many condence intervals we will study follow this same general form.
Here is how we interpret this interval: We say
We are 100(1 ) percent condent that the population mean is in
this interval.
Unfortunately, the word condent does not mean probability.
The term condence means that if we were able to sample from the population over and over again, each time computing a 100(1 ) percent condence
PAGE 88
CHAPTER 7
interval for , then 100(1) percent of the intervals we would compute would
contain the population mean .
In other words, condence refers to long term behavior of many intervals; not
probability for a single interval. Because of this, we call 100(1 ) the condence
level. Typical condence levels are
90 percent ( = 0.10)
95 percent ( = 0.05)
99 percent ( = 0.01).
The length of the 100(1 ) percent condence interval
S
Y tn1,/2
n
is equal to
S
2tn1,/2 .
n
Therefore, other things being equal,
the larger the sample size n, the smaller the interval length.
the smaller the population variance 2 , the smaller the interval length. Recall
that S 2 is an unbiased estimator for 2 .
the larger the condence level 100(1 ), the larger the interval length.
Remark: Clearly, shorter condence intervals are preferred. They are more informative. Lower condence levels will produce shorter intervals; however, you pay a
price. You have less condence that your interval contains .
Example 7.1. Acute exposure to cadmium produces respiratory distress and kidney and
liver damage (and possibly death). For this reason, the level of airborne cadmium dust
and cadmium oxide fume in the air, denoted by Y (measured in milligrams of cadmium
per m3 of air), is closely monitored. A random sample of n = 35 measurements from a
large factory are given on the next page.
PAGE 89
CHAPTER 7
0.044 0.030
0.052
0.044
0.046
0.020
0.066
0.052 0.049
0.030
0.040
0.045
0.039
0.039
0.039 0.057
0.050
0.056
0.061
0.042
0.055
0.037 0.062
0.062
0.070
0.061
0.061
0.058
0.053 0.060
0.047
0.051
0.054
0.042
0.051
Based on past experience, engineers assume a normal population distribution (for the
population of all cadmium measurements). Based on the data above, nd a 99 percent
condence interval for , the population mean level of airborne cadmium.
Solution. The interval is
S
Y tn1,/2 .
n
We can use R to calculate the sample mean y and the sample standard deviation s:
> mean(cadmium) ## sample mean
[1] 0.04928571
> sd(cadmium) ## sample standard deviation
[1] 0.0110894
CHAPTER 7
Note: It is possible to implement the t interval procedure entirely in R using the t.test
function:
> t.test(cadmium,conf.level=0.99)$conf.int
[1] 0.04417147 0.05439996
For the condence interval for the population mean to be meaningful, the random
sample assumption must be satised. However, recall from the last chapter that the t
sampling distribution result
t=
Y
t(n 1)
S/ n
does still hold approximately even if the underlying population distribution is not perfectly normal. Therefore, the condence interval (which was derived from this sampling
distribution) is also robust to normality departures.
This means that even if the population distribution is mildly non-normal, the
condence interval formula
S
Y tn1,/2
n
can still be used to estimate the population mean .
However, if there is strong evidence that the population distribution is grossly
non-normal, then you should exercise caution in using this condence interval,
especially when the sample size n is small.
PAGE 91
0.05
0.04
0.02
0.03
Cadmium data
0.06
0.07
CHAPTER 7
N(0,1) quantiles
Figure 7.2: Normal qq plot for the cadmium data in Example 7.1. The observed data
are plotted versus the theoretical quantiles from a normal distribution. The line added
passes through the rst and third theoretical quartiles.
Recall that you can use qq plots to check the normality assumption.
Cadmium data: The qq plot for the cadmium data in Figure 7.2 does not reveal any
serious departures from the normality assumption. We can feel comfortable reporting
(0.044, 0.054) mg/m3
as a 99 percent condence interval for the population mean cadmium level .
Remark: As we have just seen, statistical inference procedures are derived from specic
assumptions. Going forward, it is important to know what these assumptions are, how
critical they are, and how to check them.
PAGE 92
CHAPTER 7
7.2
Relevance: In many situations, we are concerned not with the mean of a population,
but with the variance 2 instead. If the population variance 2 is excessively large, this
could point to a potential problem with a manufacturing process, for example, where
there is too much variation in the measurements produced. Elsewhere,
in a laboratory setting, engineers might wish to estimate the variance 2 attached
to a measurement system (e.g., scale, caliper, etc.).
in eld trials, agronomists are often interested in comparing the variability levels
for dierent cultivars or genetically-altered varieties.
in clinical trials, physicians are often concerned if there are substantial dierences
in the variation levels of patient responses at dierent clinic sites.
Result: If Y1 , Y2 , ..., Yn is a random sample from a N (, 2 ) distribution, then the quantity
Q=
(n 1)S 2
2 (n 1),
2
PAGE 93
0.15
CHAPTER 7
0.00
0.05
0.10
=5
=10
=20
10
20
30
40
pchisq(q,)
qchisq(p,)
PAGE 94
0.02
0.04
0.06
0.08
0.10
CHAPTER 7
1
2
0.00
10
15
20
25
30
Figure 7.4: A 2 pdf with n 1 degrees of freedom. The upper /2 and lower /2
areas are shaded. The associated quantiles, represented in the gure by dark circles, are
denoted by 2n1,1/2 (upper) and 2n1,/2 (lower), respectively.
CHAPTER 7
1)S
= P
> 2 > 2
2n1,/2
n1,1/2
(
)
2
(n
1)S
(n 1)S 2
= P
< 2 < 2
.
2n1,1/2
n1,/2
This argument shows that
(n 1)S 2 (n 1)S 2
,
2n1,1/2 2n1,/2
Note: A 100(1) percent condence interval for the population standard deviation
arises from simply taking the square root of the endpoints of the 2 interval.
That is,
(n 1)S 2
,
2n1,1/2
(n 1)S 2
2n1,/2
PAGE 96
CHAPTER 7
Example 7.2. Industrial engineers at IKEA observed a random sample of n = 36 rivethead screws used in the Billy Bookcase system. The observed diameters of the top of the
screws (measured in cm) are given below:
1.206
1.190
1.200
1.195
1.201
1.200
1.198
1.196
1.195
1.202
1.203
1.210
1.206
1.193
1.207
1.201
1.199
1.200
1.199
1.204
1.194
1.203
1.194
1.199
1.203
1.200
1.197
1.208
1.199
1.205
1.199
1.204
1.202
1.196
1.211
1.204
The IKEA manufactured specications dictate that the population standard deviation
diameter for these screws should be no larger than = 0.003. Otherwise, there is too
much variability in the screws (which could lead to diculty in construction and hence
customer dissatisfaction). Based on the data above, nd a 95 percent condence interval
for the population standard deviation .
Solution. We rst calculate a 95 percent condence interval for the population variance
2 using
(n 1)S 2 (n 1)S 2
,
2n1,1/2 2n1,/2
)
.
CHAPTER 7
A 95 percent condence interval for the population standard deviation (which is what
we originally wanted) is calculated here:
> sd.interval = sqrt(var.interval(diameters))
> sd.interval
[1] 0.003931399 0.006322751
Interpretation: We are 95 percent condent that the population standard deviation
for the screw diameters is between 0.0039 and 0.0063 cm. This interval suggests that
the population standard deviation is larger than 0.003 cm, which indicates that there is
excessive variability in the diameters of the screws.
Assumptions: The condence interval
)
(
(n 1)S 2 (n 1)S 2
,
2n1,1/2 2n1,/2
for the population variance 2 was created based on the following assumptions:
(n 1)S 2
2 (n 1)
2
1.200
1.190
1.195
Diameter data
1.205
1.210
CHAPTER 7
N(0,1) quantiles
Figure 7.5: Normal qq plot for IKEA screw diameter data in Example 7.2. The observed
data are plotted versus the theoretical quantiles from a normal distribution. The line
added passes through the rst and third theoretical quartiles.
In the presence of non-normality, these condence intervals may give results which
are misleading (and hence potentially dangerous).
Therefore, it is very important to check the normality assumption when you construct a condence interval for a population variance 2 (or for a population standard deviation ).
Screw diameter data: Fortunately, the qq plot for the IKEA screw diameter data in
Figure 7.5 shows that there is no cause for concern. The normality assumption for these
data is not in doubt.
PAGE 99
CHAPTER 7
7.3
Situation: We now switch gears and focus on a new population-level parameter: the
population proportion p. This parameter is relevant when the characteristic we measure on each individual is binary (i.e., only 2 outcomes possible). Here are some examples:
p = proportion of defective circuit boards
p = proportion of customers who are satised
p = proportion of payments received on time
p = proportion of HIV positives in SC.
To start our discussion, we need to recall the Bernoulli trial assumptions for each
individual in the sample:
1. each individual results in a success or a failure,
2. the individuals are independent, and
3. the probability of success p is the same for every individual.
In our examples above,
success circuit board defective
success customer satised
success payment received on time
success HIV positive individual.
Recall: If the individual success/failure statuses in the sample adhere to the Bernoulli
trial assumptions, then
Y = the number of successes out of n sampled individuals
follows a binomial distribution, that is, Y b(n, p). The statistical problem at hand is
to use the information in Y to estimate p.
PAGE 100
CHAPTER 7
Y
,
n
the sample proportion. This statistic is simply the proportion of successes in the
sample (out of n individuals).
Properties: Mathematical arguments can be used to show the following results:
E(b
p) = p
se(b
p) =
p(1 p)
.
n
The rst result says that the sample proportion pb is an unbiased estimator of the
population proportion p. The second (standard error) result quanties the precision of pb
as an estimator of p.
Result: Knowing the sampling distribution of pb is critical if we are going to formalize
statistical inference procedures for p. In this situation, we appeal to an approximate
result (conferred by the Central Limit Theorem) which says that
(
)
p(1 p)
pb AN p,
,
n
when the sample size n is large. Standardizing pb, we get
Z=
pb p
p(1 p)
n
AN (0, 1),
0.2
0.1
0.3
0.4
CHAPTER 7
0.0
Figure 7.6: The N (0, 1) pdf. The upper /2 and lower /2 areas are shaded. The
associated quantiles, represented in the gure by dark circles, are denoted by z/2 (upper)
and z/2 (lower), respectively.
CHAPTER 7
pb p
1 P z/2 <
< z/2
(
z/2
= P
pb(1b
p)
n
pb(1 pb)
< pb p < z/2
n
pb(1 pb)
n
pb(1 pb)
pb(1 pb)
= P z/2
> p pb > z/2
n
n
)
(
pb(1 pb)
pb(1 pb)
= P pb + z/2
> p > pb z/2
n
n
(
)
pb(1 pb)
pb(1 pb)
= P pb z/2
< p < pb + z/2
.
n
n
We call
pb z/2
pb(1 pb)
, pb + z/2
n
pb(1 pb)
n
pb z/2
pb(1 pb)
.
n
z/2
p
b(1b
p)
n
CHAPTER 7
Example 7.3. One source of water pollution is gasoline leakage from underground
storage tanks. In Pennsylvania, a random sample of n = 74 gasoline stations is selected
from the state and the tanks are inspected; 10 stations are found to have at least one
leaking tank. Calculate a 95 percent condence interval for p, the population proportion
of gasoline stations with at least one leaking tank.
Solution. In this situation, we interpret
gasoline station = individual trial
at least one leaking tank = success
p = population proportion of stations with at least one leaking tank.
For 95 percent condence, we need z0.05/2 = z0.025 1.96.
10
0.135.
74
0.135(1 0.135)
0.135 1.96
= (0.057, 0.213).
74
Interpretation: We are 95 percent condent that the population proportion of stations
in Pennsylvania with at least one leaking tank is between 0.057 and 0.213.
CLT approximation check: We have
nb
p = 74
10
74
= 10
(
)
10
n(1 pb) = 74 1
= 64.
74
Both of these are larger than 5 = we can feel comfortable in using this condence
interval formula.
PAGE 104
CHAPTER 7
7.4
Of course, collecting data almost always costs money. Therefore, one must be cognizant
not only of the statistical issues associated with sample size determination, but also
of the practical issues like cost, time spent in data collection, personnel training, etc.
7.4.1
Y z/2 .
n
Denote the margin of error by
B = z/2 .
n
This is the quantity that is added/subtracted to the point estimate Y to form the condence interval for the population mean .
PAGE 105
CHAPTER 7
Formula: In the setting described above, it is possible to determine the sample size n
necessary once we specify these three pieces of information:
the value of 2 (e.g., an educated guess at its value; e.g., from historical data, etc.)
the condence level, 100(1 )
the margin of error, B.
This is true because
( z )2
/2
.
B = z/2 n =
B
n
This is the necessary sample size to guarantee a prescribed level of condence 100(1 )
and margin of error B.
Example 7.4. In a biomedical experiment, we would like to estimate the population
mean remaining life of healthy rats that are given a certain dose of a toxic substance.
Suppose that we would like to write a 95 percent condence interval for with a margin
of error equal to B = 2 days. From past studies, remaining rat lifetimes have been
approximated by a normal distribution with standard deviation = 8 days. How many
rats should we use for the experiment?
Solution. With z0.05/2 = z0.025 1.96, B = 2, and = 8, the desired sample size to
estimate is
n=
(z
/2
)2
1.96 8
2
)2
61.46.
(z
/2
)2
=
1.645 8
3
)2
19.24.
PAGE 106
CHAPTER 7
7.4.2
Setting: Suppose we would like to write a 100(1 ) percent condence interval for p,
a population proportion. We know that
pb z/2
pb(1 pb)
n
is an approximate 100(1) percent condence interval for p. What sample size n should
we use?
Note: To determine the necessary sample size n, we rst need to specify two pieces of
information:
the condence level 100(1 )
the margin of error:
B = z/2
pb(1 pb)
.
n
A small problem arises. Note that the margin of error B depends on pb. Unfortunately,
pb can only be calculated once we know the sample size n. We overcome this problem by
replacing pb with p0 , an a priori guess at its value. The last expression becomes
p0 (1 p0 )
.
B = z/2
n
Solving this equation for n, we get
n=
(z
/2
)2
p0 (1 p0 ).
This is the desired sample size n to write a 100(1 ) percent condence interval for the
population proportion p with a prescribed margin of error (roughly) equal to B. I say
roughly, because there may be additional uncertainty arising from our use of p0 (our
best guess).
Remark: If there is no sensible guess for p available, use p0 = 0.5. In this situation, the
resulting value for n will be as large as possible. Put another way, using p0 = 0.5 gives
PAGE 107
CHAPTER 8
the most conservative solution (i.e., the largest sample size, n). This is true because
n = n(p0 ) =
(z
/2
)2
p0 (1 p0 ),
1.96
0.02
1.96
0.02
1.96
0.02
)2
0.05(1 0.05) 457
)2
0.10(1 0.10) 865
)2
0.50(1 0.50) 2401.
As we can see, the guessed value of p0 has a substantial impact on the nal sample
size calculation. Furthermore, it may not be practical to sample 2,401 parts, as would
be required by the most conservative approach.
PAGE 108
CHAPTER 8
Two-Sample Inference
Preview: In this chapter, we discuss two-sample inference procedures for the following
population parameters:
The dierence of two population means 1 2 (Section 8.1)
The ratio of two population variances 22 /12 (Section 8.2)
The dierence of two population proportions p1 p2 (Section 8.3).
Remember that these are population-level quantities (now involving two dierent populations), so they are unknown. Our goal is to use sample information (now with two
samples) to estimate these quantities.
Usefulness: In practice, it is very common to compare the same characteristic (e.g.,
mean, variance, proportion, etc.) on two dierent populations. For example, we may
wish to compare
the population mean starting salaries of male and female engineers (compare 1
and 2 ). Is there evidence that males have a larger mean starting salary?
the population variance of sound levels from two indoor swimming pool designs
(compare 12 and 22 ). Are the sound-level acoustics of a new design more variable
than the standard design?
the population proportion of defectives produced from two dierent suppliers
(compare p1 and p2 ). Are there dierences between the two suppliers?
Note: Our methods in the last chapter are applicable only for a single population
(i.e., a population mean , a population variance 2 , and a population proportion p).
We therefore extend these methods to two populations. We start with comparing two
population means.
PAGE 109
CHAPTER 8
8.1
Sample 2 :
n1
1
=
Y1j = sample mean for sample 1
n1 j=1
Y 2+
n2
1
=
Y2j = sample mean for sample 2
n2 j=1
S12
1
1
=
(Y1j Y 1+ )2 = sample variance for sample 1
n1 1 j=1
S22
2
1
=
(Y2j Y 2+ )2 = sample variance for sample 2.
n2 1 j=1
Goal: Our goal is to construct a 100(1) percent condence interval for the dierence
of two population means 1 2 .
Important: How we construct this interval depends on our assumptions on the population variances 12 and 22 . In particular, we consider two cases:
12 = 22 ; that is, the two population variances are equal
12 = 22 ; that is, the two population variances are not equal.
8.1.1
(Y 1+ Y 2+ ) (1 2 )
(
t(n1 + n2 2),
)
Sp2 n11 + n12
PAGE 110
0.15
CHAPTER 8
0.00
0.05
0.10
Population 1
Population 2
40
50
60
70
where
Sp2 =
0.2
0.1
0.3
0.4
CHAPTER 8
1
2
0.0
Figure 8.2: A t pdf with n1 + n2 2 degrees of freedom. The upper /2 and lower /2
areas are shaded. The associated quantiles, represented in the gure by dark circles, are
denoted by tn1 +n2 2,/2 (upper) and tn1 +n2 2,/2 (lower), respectively.
(Y 1+ Y 2+ ) (1 2 )
= 1 .
t
<
P
<
t
n
+n
2,/2
n
+n
2,/2
1
2
1
2
(
)
1
1
Sp2 n1 + n2
This probability equation is seen by examining Figure 8.2.
PAGE 112
CHAPTER 8
obtain
(Y 1+ Y 2+ ) tn1 +n2 2,/2
(
Sp2
)
1
1
+
.
n1 n2
This is a 100(1 ) percent condence interval for the dierence of two population means 1 2 .
We see that the interval again has the same form:
point estimate
|
{z
}
Y 1+ Y 2+
standard
| {z error} .
quantile
| {z }
tn1 +n2 2,/2
Sp2
1
+ n1
n1
2
Example 8.1. In the vicinity of a nuclear power plant, environmental engineers from
the EPA would like to determine if there is a dierence between the population mean
weight in sh (of the same species) from two locations. Independent samples are taken
from each location and the following weights (in ounces) are observed:
Location 1:
21.9 18.5
12.3
16.7
21.0
15.1
18.2
23.0
Location 2:
21.0 19.6
14.4
16.9
23.4
14.6
10.4
16.5
36.8
26.6
Construct a 90 percent condence interval for the population mean weight dierence
1 2 , where the mean weight 1 (2 ) corresponds to location 1 (2).
PAGE 113
20
0
10
30
40
CHAPTER 8
Location 1
Location 2
8.7604376
PAGE 114
CHAPTER 8
(
Sp2
1
1
+
n1 n2
18
Location 2
16
25
10
15
12
14
20
Location 1
30
20
22
35
CHAPTER 8
1.5
1.0
0.5
0.0
0.5
1.0
1.5
1.5
1.0
N(0,1) quantiles
0.5
0.0
0.5
1.0
1.5
N(0,1) quantiles
Figure 8.4: Quantile-quantile plots for the sh weight data in Example 8.1.
8.1.2
(Y 1+ Y 2+ ) (1 2 )
(
)
1
1
2
Sp n1 + n2
as a basis for inference because the pooled sample variance Sp2 does not estimate anything meaningful. Constructing a condence interval for 1 2 becomes more dicult
theoretically. However, we can write an approximate condence interval.
Formula: An approximate 100(1 ) percent condence interval for the dierence of
two population means 1 2 is given by
(Y 1+ Y 2+ ) t,/2
S12 S22
+ ,
n1
n2
2
n2
n2 1
PAGE 116
CHAPTER 8
Example 8.2. You are part of a recycling project that is examining how much paper is
being discarded (not recycled) by employees at two large plants. These data are obtained
on the amount of white paper thrown out per year by employees (data are in hundreds
of pounds). Samples of employees at each plant were randomly selected.
Plant 1:
Plant 2:
3.01 2.58
3.04
1.75
2.87
2.57
2.51
2.93
2.85
3.09
1.43 3.36
3.18
2.74
2.25
1.95
3.68
2.29
1.86
2.63
2.83 2.04
2.23
1.92
3.02
3.99 2.08
3.66
1.53
4.27
4.31
2.62
4.52
3.80
5.30
3.41 0.82
3.03
1.95
6.45
1.86
1.87
3.98
2.74
4.81
4
0
CHAPTER 8
Plant 1
Plant 2
Figure 8.5: Boxplots of discarded white paper amounts (in 100s lb) in Example 8.2.
Plant 2
1.5
2.0
2.5
Plant 1
3.0
3.5
CHAPTER 8
N(0,1) quantiles
N(0,1) quantiles
Figure 8.6: Quantile-quantile plots for the white paper data in Example 8.2.
(unequal population variances). Advice: If you are unsure about which interval to use,
go with the unequal variance interval. The penalty for using it when 12 = 22 is much
smaller than the penalty for using the equal variance interval when 12 = 22 .
8.1.3
Example 8.3. Ergonomics experts hired by a large company designed a study to determine whether more varied work conditions would have any impact on arm movement.
The data on the next page were obtained on a random sample of n = 26 employees. Each
observation is the amount of time, expressed as a percentage of the total time observed,
during which arm elevation was below 30 degrees. This percentage is a surrogate
for the percentage of time spent on repetitive tasks. The two measurements from each
employee were obtained 18 months apart. During this 18-month period, work conditions
were changed by the ergonomics team, and subjects were allowed to engage in a wider
variety of work tasks.
PAGE 119
CHAPTER 8
Individual
Before
After
Individual
Before
After
81.3
78.9
14
74.9
58.3
87.2
91.4
15
75.8
62.5
86.1
78.3
16
72.6
70.2
82.2
78.3
17
80.8
58.7
90.8
84.4
18
66.5
66.6
86.9
67.4
19
72.2
60.7
96.5
92.8
20
56.5
65.0
73.0
69.9
21
82.4
73.7
84.2
63.8
22
88.8
80.4
10
74.5
69.7
23
80.0
78.8
11
72.0
68.4
24
91.1
81.8
12
73.8
71.8
25
97.5
91.6
13
74.2
58.3
26
70.0
74.2
Table 8.1: Ergonomics data. Percentage of time arm elevation was less than 30 degrees.
Question: Does the population mean time (during which elevation is below 30 degrees)
decrease after the ergonomics team changes the working conditions?
Terminology: A matched-pairs design is an experimental design where one obtains
a pair of measurements on each individual (e.g., employee, material, machine, etc.):
one measurement corresponds to Treatment 1
the other measurement corresponds to Treatment 2
Clearly, the two samples are no longer independent. Each individual contributes a
response to both samples.
This type of design removes variation among the individuals. This allows you to compare
the two treatments (e.g., before/after working environment) under more homogeneous
conditions where only variation within individuals is present (that is, the variation arising
from the dierence in the two treatments).
PAGE 120
CHAPTER 8
Table 8.2: Ergonomics example. Sources of variation in the two independent sample and
matched pairs designs.
Design
Sources of Variation
Matched Pairs
within employees
Advantage: When you remove extra variability, this enables you to compare the two
experimental conditions (treatments) more precisely. This gives you a better chance of
identifying a dierence between the treatment means if one really exists.
In a design with two independent samples, the extra variation among individuals
may prevent us from being able to identify this dierence!
CHAPTER 8
Individual
Before
After
Dierence
Individual
Before
After
Dierence
81.3
78.9
2.4
14
74.9
58.3
16.6
87.2
91.4
4.2
15
75.8
62.5
13.3
86.1
78.3
7.8
16
72.6
70.2
2.4
82.2
78.3
3.9
17
80.8
58.7
22.1
90.8
84.4
6.4
18
66.5
66.6
0.1
86.9
67.4
19.5
19
72.2
60.7
11.5
96.5
92.8
3.7
20
56.5
65.0
8.5
73.0
69.9
3.1
21
82.4
73.7
8.7
84.2
63.8
20.4
22
88.8
80.4
8.4
10
74.5
69.7
4.8
23
80.0
78.8
1.2
11
72.0
68.4
3.6
24
91.1
81.8
9.3
12
73.8
71.8
2.0
25
97.5
91.6
5.9
13
74.2
58.3
15.9
26
70.0
74.2
4.2
Table 8.3: Ergonomics data. Percentage of time arm elevation was less than 30 degrees.
The data dierences Dj = Y1j Y2j have been added.
where D and SD are the sample mean and sample standard deviation of the dierences,
respectively, is an interval estimate for
D = 1 2
=
The parameter D = 1 2 describes the population mean dierence for the two
treatment groups. If the two population means are then same, then D = 0. Therefore,
If the condence interval for D includes 0, this does not suggest that the two
population means are dierent.
If the condence interval for D does not include 0, this suggests that the two
population means are dierent.
PAGE 122
CHAPTER 8
> t.test(diff,conf.level=0.95)$conf.int
[1] 3.636034 9.894735
Note that this is a one-sample condence interval calculated using the data dierences.
Interpretation: We are 95 percent condent that the population mean dierence D =
1 2 is between 3.6 and 9.9 percent. This interval does not include 0 and contains
only positive values.
Therefore, we have evidence that the population mean percentage of time that arm
elevation is below 30 degrees is larger in the before condition than in the after
condition.
In other words, there is evidence that the change in work conditions implemented
by the ergonomics team (in the 18-month interim) did reduce this population mean
time.
Ergonomics data: A normal qq plot for the data dierences is given in Figure 8.7.
There might be a mild departure from normality in the upper tail.
However, remember that one-sample t condence intervals (for means) are generally
robust to these mild departures. Therefore, this slight departure likely does not
aect our conclusion.
PAGE 123
10
5
5
Data differences
15
20
CHAPTER 8
N(0,1) quantiles
Figure 8.7: Normal qq plot for the ergonomics data in Example 8.3. The observed data
dierences are plotted versus the theoretical quantiles from a normal distribution. The
line added passes through the rst and third theoretical quartiles.
8.2
Importance: Recall that when we wrote a condence interval for 1 2 , the dierence
of the population means (with independent samples), we proposed two intervals:
one interval that assumed 12 = 22
one interval that assumed 12 = 22 .
We now propose a condence interval procedure that can be used to determine which
assumption is more appropriate.
PAGE 124
0.4
1=5, 2=10
1=5, 2=20
1=10, 2=20
0.0
0.2
0.6
0.8
CHAPTER 8
Sample 2 :
Goal: Our goal is to construct a 100(1 ) percent condence interval for the ratio of
population variances 22 /12 .
Result: Under the setting described above,
F =
S12 /12
F (n1 1, n2 1),
S22 /22
CHAPTER 8
pf(q,1 , 2 )
qf(p,1 , 2 )
> qf(0.975,10,10)
[1] 3.716792
> qf(0.025,10,10)
[1] 0.2690492
PAGE 126
0.4
0.2
0.6
0.8
CHAPTER 8
2
1
0.0
Figure 8.9: An F pdf with n1 1 and n2 1 degrees of freedom. The upper /2 and
lower /2 areas are shaded. The associated quantiles, represented in the gure by dark
circles, are denoted by Fn1 1,n2 1,1/2 (upper) and Fn1 1,n2 1,/2 (lower), respectively.
CHAPTER 8
Example 8.4. Two automated lling processes are used in the production of automobile
paint. The target weight of each process is 128.0 uid oz (1 gallon). There is little concern
about the process population mean ll amounts (no complaints about under/overlling
on average). However, there is concern that the population variation levels between the
two processes are dierent. To test this claim, industrial engineers took independent
random samples of n1 = 24 and n2 = 24 gallons of paint and observed the ll amounts.
Process 1:
Process 2:
127.75 127.87
127.86
127.92
128.03
127.94
127.91
128.10
128.01
128.11
127.79
127.93
127.89
127.96
127.80
127.94
128.02 127.82
128.11
127.92
127.74
127.78
127.85
127.96
127.90 127.90
127.74
127.93
127.62
127.76
127.63
127.93
127.86
127.73
127.82
127.84
128.06
127.88
127.85
127.60
128.02 128.05
127.95
127.89
127.82
127.92
127.71
127.78
PAGE 128
127.8
127.6
Fluid ounces
128.0
128.2
CHAPTER 8
Process 1
Process 2
Note: There is no internal function in R to calculate the condence interval for the ratio
of two population variances 22 /12 , so I wrote one:
ratio.var.interval = function(data.1,data.2,conf.level=0.95){
df.1 = length(data.1)-1
df.2 = length(data.2)-1
F.lower = qf((1-conf.level)/2,df.1,df.2)
F.upper = qf((1+conf.level)/2,df.1,df.2)
s2.1 = var(data.1)
s2.2 = var(data.2)
c((s2.2/s2.1)*F.lower,(s2.2/s2.1)*F.upper)
}
PAGE 129
127.9
Process 2
127.8
127.7
127.9
127.6
127.8
Process 1
128.0
128.0
128.1
CHAPTER 8
N(0,1) quantiles
N(0,1) quantiles
Figure 8.11: Quantile-quantile plots for the paint ll volume data in Example 8.4.
> ratio.var.interval(process.1,process.2)
[1] 0.5885236 3.1448830
Interpretation: We are 95 percent condent that the ratio of the population variances
22 /12 is between 0.589 and 3.145. Because this interval includes 1, we do not have
evidence that the population variances 12 and 22 are dierent for the two processes.
Warning: Like the 2 interval for single population variance 2 , the two-sample F
interval for the ratio of two population variances 22 /12 is not robust to normality
departures. This is true because the sampling distribution
F =
S12 /12
F (n1 1, n2 1)
S22 /22
CHAPTER 8
8.3
Interest: We now extend our condence interval procedure for a single population proportion p to two populations. Dene
p1 = population proportion in Population 1
p2 = population proportion in Population 2.
For example, we might want to compare the proportion of
defective circuit boards for two dierent suppliers
satised customers before and after a product design change (e.g., Facebook, etc.)
on-time payments for two classes of customers
HIV positives for individuals in two demographic classes.
Point estimators: We assume that there are two independent random samples of individuals (one sample from each population to be compared). Dene
Y1 = number of successes in Sample 1 b(n1 , p1 )
Y2 = number of successes in Sample 2 b(n2 , p2 ).
The point estimators for p1 and p2 are the sample proportions, dened by
Y1
n1
Y2
=
.
n2
pb1 =
pb2
(b
p1 pb2 ) (p1 p2 )
Z=
AN (0, 1).
p2 (1p2 )
p1 (1p1 )
+ n2
n1
PAGE 131
CHAPTER 8
z/2
standard
|
{z error} .
p
b1 (1b
p1 )
p
b (1b
p )
+ 2 n 2
n1
2
CHAPTER 9
Example 8.5. A large public health study was conducted to estimate the prevalence and
to identify risk factors of hepatitis B virus (HBV) infection among Irish prisoners. Two
independent samples of female (n1 = 82) and male (n2 = 555) prisoners were obtained
from ve prisons in Ireland:
18 out of 82 female prisoners were HBV-positive
28 out of 555 male prisoners were HBV-positive.
Find a 95 percent condence interval for p1 p2 , the dierence in the population proportions for the two genders (Female = 1; Male = 2).
Analysis. There is no internal function in R to calculate the condence interval for
the dierence of two population proportions (at least not that I could nd quickly), so I
wrote one:
proportion.diff.interval = function(y.1,n.1,y.2,n.2,conf.level=0.95){
z.upper = qnorm((1+conf.level)/2)
var.1 = (y.1/n.1)*(1-y.1/n.1)/n.1
var.2 = (y.2/n.2)*(1-y.2/n.2)/n.2
se = sqrt(var.1+var.2)
moe = z.upper*se
c((y.1/n.1-y.2/n.2)-moe,(y.1/n.1-y.2/n.2)+moe)
}
> proportion.diff.interval(18,82,28,555)
[1] 0.07764115 0.26048234
Interpretation: We are 95 percent condent the dierence of the population proportions
p1 p2 is between 0.078 and 0.260. This interval does not contain 0 and contains only
positive values. This suggests that the population proportion of female prisoners who are
HBV positive is larger than the corresponding male population proportion. The sample
size conditions on the previous page are satised.
PAGE 133
CHAPTER 9
9.1
Introduction
Recall: In the last chapter, we discussed condence intervals for the dierence of two
population means 1 2 . Perhaps more importantly, we also saw that the design of
the experiment or study completely determined how the analysis should proceed.
When the two samples are independent, this is called a (two) independentsample design.
When the two samples are obtained on the same individuals (so that the samples
are dependent), this is called a matched pairs design.
Condence interval procedures for 1 2 depend on the design of the study.
Terminology: More generally, the purpose of an experiment is to investigate dierences between or among two or more treatments. In a statistical framework, we do this
by comparing the population means of the responses to each treatment.
In order to detect treatment mean dierences, we must try to control the eects
of error so that any variation we observe can be attributed to the eects of the
treatments rather than to structural dierences among the individuals.
For example, in Example 8.2, there may be a systematic source of variation
arising from the ages of employees in the recycling project (e.g., younger employees
may be more inclined to recycle paper instead of discarding it).
Our two-independent sample design (one sample from Plant 1 and one sample from
Plant 2) did not consider this potential confounding eect. In other words, even
if age of the employee is a signicant source of variability, our independent sample
analysis does not acknowledge it.
PAGE 134
CHAPTER 9
CHAPTER 9
Example 9.1. Mortar mixes are usually classied on the basis of compressive strength
and their bonding properties and exibility. In a building project, engineers wanted to
compare specically the population mean strengths of four types of mortars:
1. ordinary cement mortar (OCM)
2. polymer impregnated mortar (PIM)
3. resin mortar (RM)
4. polymer cement mortar (PCM).
Random samples of specimens of each mortar type were taken; each specimen was subjected to a compression test to measure strength (MPa). Here are the strength measurements taken on dierent mortar specimens (36 in all).
48.06
38.27
38.88
42.74
49.62
PIM:
52.79
64.87
53.27
51.24
55.87
61.76
67.15
RM:
48.95 62.41
52.11
60.45
58.07
52.16
61.71
61.06
57.63
56.80
50.99
51.52
52.85
46.75
48.31
PCM:
55
50
35
40
45
Strength (MPa)
60
65
CHAPTER 9
OCM
PIM
RM
PCM
Figure 9.1: Boxplots of strength data (MPa) for four mortar types in Example 9.1.
Goal: We now develop a statistical inference procedure that allows us to test this
type of hypothesis in a one-way classication.
PAGE 137
CHAPTER 9
9.2
Overall F test
ni
1
=
Yij
ni j=1
i
1
(Yij Y i+ )2
=
ni 1 j=1
Si2
Y ++
ni
t
1
Yij .
=
N i=1 j=1
The statistics Y i+ and Si2 denote the sample mean and the sample variance,
respectively, of the ith sample. The overall sample mean Y ++ is the sample
mean of all the data (aggregated across all t treatment groups).
CHAPTER 9
The null hypothesis H0 says that there is no treatment dierence, that is, all
t population means are the same.
The alternative hypothesis H1 says that a dierence among the t population
means exists somewhere. It does not specify how the means are dierent.
When performing a hypothesis test, we basically decide which hypothesis is more
supported by the data.
Sample 2:
..
.
Sample t:
PAGE 139
CHAPTER 9
=
(Yij Y i+ )2 .
i=1 j=1
{z
(ni 1)Si2
The sample variance Si2 estimates the population parameter 2 (which assumed to
be common across all t populations) from within the ith sample.
The weighted average of these estimates
(
)
(
)
(
)
n1 1
n2 1
nt 1
2
2
MSres =
S1 +
S2 + +
St2
N t
N t
N t
SSres
=
N t
is called the residual mean squares. It is an unbiased estimator of 2 regardless
of whether H0 or H1 is true.
The within estimator MSres is a generalization of the pooled sample variance
estimator Sp2 we discussed in Section 8.1 with t = 2 populations.
Across Estimator: We assume a common sample size n1 = n2 = = nt = n to
simplify notation (i.e., a balanced design).
Recall: From Result 1 in Chapter 6 (pp 75), we know that if a sample arises from a
normal population, the sample mean is also normally distributed. Therefore, the sample
mean of the ith sample
Y i+
(
)
2
N i ,
.
n
PAGE 140
CHAPTER 9
1
(Y i+ Y ++ )2
t 1 i=1
t
MStrt =
SStrt
MStrt =
SStrt
CHAPTER 9
Summary: If you are terried by the preceding derivation, that is ne. Just know the
following:
1. When H0 is true (i.e., the population means are equal), then
E(MStrt ) = 2
E(MSres ) = 2 .
These two facts suggest that when H0 is true,
F =
MStrt
1.
MSres
MStrt
> 1.
MSres
MStrt
F (t 1, N t).
MSres
PAGE 142
0.4
0.0
0.2
0.6
CHAPTER 9
10
15
Figure 9.2: F (3, 32) pdf. This is the sampling distribution of F in Example 9.1 when H0
is true. An at F = 16.848 has been added.
3 1520.88
32
962.86
506.96
Pr(>F)
30.09
Conclusion: There is very strong evidence that at least one of the four mortar strength
population means is dierent. This value of F is not consistent with H0 . It is much more
consistent with H1 .
PAGE 143
CHAPTER 9
Source
df
SS
MS
Treatments
t1
SStrt
MStrt =
SStrt
t1
Residuals
N t
SSres
MSres =
SSres
N t
Total
N 1
SStotal
F =
MStrt
MSres
ni
t
(Yij Y ++ )
i=1 j=1
ni (Y i+ Y ++ ) +
2
i=1
ni
t
(Yij Y i+ )2
i=1 j=1
= SStrt + SSres .
SStotal measures how observations vary about the overall mean, without regard to
treatment groups; that is, SStotal measures the total variation in all the data.
SStotal can be partitioned into two components:
SStrt measures how much of the total variation is due to the treatment groups.
SSres measures what is left over, which we attribute to inherent variation
among the individuals.
Degrees of freedom (df) add down.
Mean squares (MS) are formed by dividing sums of squares by the corresponding
degrees of freedom.
The ratio of the mean squares (MS) gives the F statistic.
Terminology: The probability value (p-value) for a hypothesis test measures how
much evidence we have against H0 . It is important to remember the following:
PAGE 144
CHAPTER 9
Mortar data: For the strength/mortar type data in Example 9.1 (from the R output),
we see that
p-value 0.0000009576.
This is obviously extremely small which suggests that we have an enormous amount
of evidence against H0 .
In this example, the p-value is calculated as the area to the right of F = 16.848 on
the F (3, 32) probability density function. See Figure 9.2 and it is easy to see why
this is so small.
The p-value is a probability. For the mortar data, the p-value is interpreted as
follows:
If H0 is true, the probability we should get a test statistic equal to or
larger than F = 16.848 is 0.0000009576.
Because this event (i.e., getting an F 16.848) is extremely unlikely, this suggests
strongly that H0 is not true.
In other words, this very small p-value (which comes from a very large F statistic)
is more consistent with H1 .
P-value Rules: Probability values are used in more general hypothesis test settings in
statistics (not just in one-way classication).
Q: How small does a p-value have to get before we reject H0 in favor of H1 ?
A: Unfortunately, there is no right answer to this question. What is commonly done is
the following.
First choose a signicance level that is small. This represents the probability
that we will reject a true H0 , that is,
= P (Reject H0 |H0 true).
PAGE 145
CHAPTER 9
Common values of chosen beforehand are = 0.10, = 0.05 (the most common),
and = 0.01.
The smaller the is chosen to be, the more evidence one requires to reject H0 .
This is a true statement because of the following well-known decision rule:
p-value < = reject H0 .
Therefore, the value of chosen by the experimenter (you!) determines how small
the p-value must get before H0 is ultimately rejected.
For the strength/mortar type data, there is no ambiguity in our decision. For other
situations (e.g., p-value = 0.063), the decision may not be as clear cut.
PAGE 146
CHAPTER 9
9.3
PAGE 147
CHAPTER 9
1 3 ,
1 4 ,
2 3 ,
2 4 ,
3 4 ,
where
1 = population mean strength for mortar type OCM
2 = population mean strength for mortar type PIM
3 = population mean strength for mortar type RM
4 = population mean strength for mortar type PCM.
occurs
P
j=1
)
Aj
P (Aj ) (J 1).
j=1
To see how this inequality can be used in our current discussion, dene the event
Aj = {jth condence interval includes its population mean dierence},
for j = 1, 2, ..., J. The event
J
j=1
PAGE 148
CHAPTER 9
In this light, consider the following table, which contains a lower bound on how
small this probability can be (for dierent values of t and J). This table assumes
that each pairwise interval has been constructed at the nominal 1 = 0.95 level.
# of treatments t # of intervals J =
(t)
2
Lower bound
3(0.95) 2 = 0.85
6(0.95) 5 = 0.70
10
20(0.95) 9 = 0.50
6
..
.
15
..
.
15(0.95) 14 = 0.25
..
.
10
45
45(0.95) 44 = 1.25!!
(
)
1
1
+
,
(Y i+ Y i + ) qt,N t, MSres
ni ni
where qt,N t, is the Tukey quantile that guarantees a family-wise condence level
of 100(1 ) percent.
Mortar data: We use R to construct the Tukey condence intervals. The family-wise
condence level is 95 percent:
PAGE 149
CHAPTER 9
$mortar.type
diff
PCM-OCM
lwr
2.48000 -4.950955
upr
p adj
9.910955 0.8026758
PIM-OCM 15.21575
RM-OCM
12.99875
PIM-PCM 12.73575
RM-PCM
10.51875
RM-PIM
-2.21700 -8.863448
4.429448 0.8029266
Note: In the R output, the columns labeled lwr and upr give, respectively, the lower
and upper limits of the pairwise condence intervals.
PCM-OCM: We are (at least) 95 percent condent that the dierence in the population
mean strengths for the PCM and OCM mortars is between 4.95 and 9.91 MPa.
This condence interval includes 0, so we cannot conclude these two population means are dierent.
An equivalent nding is that the adjusted p-value for these two mortar
types, given in the p adj column, is large (0.803).
PIM-OCM: We are (at least) 95 percent condent that the dierence in the population
mean strengths for the PIM and OCM mortars is between 8.17 and 22.27 MPa.
This condence interval does not include 0 and contains only positive values.
This suggests that the population mean strength of the PIM mortar is greater
than the population mean strength of the OCM mortar.
PAGE 150
CHAPTER 10
An equivalent nding is that the adjusted p-value for these two mortar
types, given in the p adj column, is very small (<0.001).
Interpretations for the remaining 4 condence intervals are written similarly.
The main point is this:
If a pairwise condence interval (for two population means) includes 0, then
these population means are not declared to be dierent.
If a pairwise interval does not include 0, then the population means are
declared to be dierent.
The conclusions we make for all possible pairwise comparisons are at the
100(1 ) percent condence level.
Mortar data: The following pairs of population means are declared to be dierent:
PIM-OCM RM-OCM PIM-PCM RM-PCM.
The following pairs of population means are declared to be not dierent:
PCM-OCM RM-PIM.
We can therefore conclude:
The PIM and RM population mean strengths are larger than the OCM and PIM population mean strengths.
The PIM and RM population mean strengths are not dierent.
The OCM and PIM population mean strengths are not dierent.
Furthermore, we have an overall (family-wise) condence level of 95 percent that all of our
conclusions are correct. Had we not used an adjusted analysis based on Tukeys method
(e.g., just calculate all unadjusted pairwise intervals), our overall condence level would
have been much lower (as low as 70 percent).
PAGE 151
CHAPTER 10
10
10.1
that is, g is a linear function of 0 , 1 , 2 , ..., k . We call this a linear regression model.
The response variable Y is random (but we do get to observe its value).
The independent variables x1 , x2 , ..., xk are xed (and observed).
The regression parameters 0 , 1 , 2 , ..., k are unknown. These are to be estimated on the basis of the observed data.
The error term is random (and not observed).
PAGE 152
CHAPTER 10
= 0 + 1 x +
| {z }
g(x)
= 0 + 1 x + 2 x2 +
|
{z
}
g(x)
= 0 + 1 x1 + 2 x2 + 3 x1 x2 +.
{z
}
|
g(x1 ,x2 )
The term linear does not refer to the shape of the regression function g. It refers to
how the regression parameters 0 , 1 , 2 , ..., k enter the g function.
Important: Regression models (linear or otherwise) are models for a population of
individuals. From a statistical inference standpoint, our goal is the same as in previous
chapters. We will use sample information to estimate the population parameters in the
model. We say that we are estimating or tting the model with the observed data.
10.2
CHAPTER 10
Therefore, we have the following interpretations for the population regression parameters
0 and 1 :
0 quanties the population mean of Y when x = 0.
1 quanties the population-level change in E(Y ) brought about by a one-unit
change in x.
Example 10.1. As part of a waste removal project, a new compression machine for
processing sewage sludge is being studied. Engineers are interested in the following
variables:
Y
Obs
125.3
77.9
11
159.5
79.9
98.2
76.8
12
145.8
79.0
201.4
81.5
13
75.1
76.7
147.3
79.8
14
151.4
78.2
145.9
78.2
15
144.2
79.5
124.7
78.3
16
125.0
78.1
112.2
77.5
17
198.8
81.5
120.2
77.0
18
132.5
77.0
161.2
80.1
19
159.6
79.0
10
178.9
80.2
20
110.7
78.6
Table 10.1: Sewage data. Moisture (Y , measured as a percentage) and machine ltration
rate (x, measured in kg-DS/m/hr). There are n = 20 observations.
PAGE 154
80
79
77
78
Moisture (Percentage)
81
CHAPTER 10
80
100
120
140
160
180
200
Figure 10.1 displays the sample data in a scatterplot. This sample information suggests
the variables Y and x are linearly related, although there is a large amount of variation
that is unexplained.
This unexplained variability could arise from other independent variables (e.g.,
applied temperature, pressure, sludge mass, etc.) that also inuence the moisture
percentage Y but are not present in the model.
It could also arise from measurement error or just random variation in the sludge
compression process.
Inference: What does the sample information suggest about the population? Do we
have evidence that Y and x are linearly related in the population?
PAGE 155
CHAPTER 10
10.3
[Yi (0 + 1 xi )]2 .
i=1
Denote the least squares estimators by b0 and b1 , respectively, that is, the values of 0
and 1 that minimize Q(0 , 1 ). A two-variable calculus minimization argument can be
used to nd expressions for b0 and b1 . Taking partial derivatives of Q(0 , 1 ), we obtain
Q(0 , 1 )
set
= 2
(Yi 0 1 xi ) = 0
0
i=1
n
Q(0 , 1 )
set
= 2
(Yi 0 1 xi )xi = 0.
1
i=1
n
77
78
79
Moisture (Percentage)
80
81
CHAPTER 10
80
100
120
140
160
180
200
filtration.rate
72.95855
0.04103
The least squares estimates (to 3 dp) for the sewage data are
b0 = 72.959
b1 = 0.041.
PAGE 157
CHAPTER 10
10.4
CHAPTER 10
Results: Under the assumptions stated on the previous page, we can derive the following
results for the simple linear regression model
Y = 0 + 1 x + .
Result 1:
Y N (0 + 1 x, 2 ).
In other words, the response variable Y is normally distributed with mean 0 + 1 x
and variance 2 . Note that the population mean of Y depends on x. The population
variance of Y does not depend on x.
Result 2. The least squares estimators b0 and b1 are unbiased estimators of 0
and 1 , respectively, that is,
E(b0 ) = 0
E(b1 ) = 1 .
Result 3. The least squares estimators b0 and b1 have normal sampling distributions; specically,
b0 N (0 , c00 2 )
and
b1 N (1 , c11 2 ),
where
c00 =
1
x2
+
n SSxx
and
c11 =
1
.
SSxx
These distributions are needed to write condence intervals and perform hypothesis
tests for 0 and 1 (i.e., to perform statistical inference for the population).
10.5
CHAPTER 10
Yb = b0 + b1 x
e = Y Yb
Obs
Yb = b0 + b1 x
e = Y Yb
125.3
77.9
78.100
0.200
11
159.5
79.9
79.503
0.397
98.2
76.8
76.988
0.188
12
145.8
79.0
78.941
0.059
201.4
81.5
81.223
0.277
13
75.1
76.7
76.040
0.660
147.3
79.8
79.003
0.797
14
151.4
78.2
79.171
0.971
145.9
78.2
78.945
0.745
15
144.2
79.5
78.876
0.624
124.7
78.3
78.075
0.225
16
125.0
78.1
78.088
0.012
112.2
77.5
77.563
0.062
17
198.8
81.5
81.116
0.384
120.2
77.0
77.891
0.891
18
132.5
77.0
78.396
1.396
161.2
80.1
79.573
0.527
19
159.6
79.0
79.508
0.508
10
178.9
80.2
80.299
0.099
20
110.7
78.6
77.501
1.099
Table 10.2: Sewage data. Fitted values and residuals from the least squares t.
Note that
If an observations Y value is above the least squares regression line, then Yi > Ybi
and its residual ei is positive.
PAGE 160
CHAPTER 10
If an observations Y value is below the least squares regression line, then Yi < Ybi
and its residual ei is negative.
If an observations Y value is on the least squares regression line, then Yi = Ybi and
its residual ei is zero.
Interesting fact: In our simple linear regression model,
n
ei =
(Yi Ybi ) = 0.
i=1
i=1
That is, the residuals sum to zero. For the sewage data in Example 10.1,
e2i
i=1
(Yi Ybi )2 .
i=1
SSres
n2
b=
MSres =
SSres
n2
CHAPTER 10
10.6
CHAPTER 10
Condence interval: Under our regression model assumptions, the following sampling
distribution arises:
b1 1
t=
t(n 2).
MSres
SSxx
This result can be used to derive a 100(1 ) percent condence interval for 1 ,
which is given by
b1 tn2,/2
MSres
.
SSxx
The value tn2,/2 is the upper /2 quantile from the t(n 2) distribution.
Note the form of the interval:
point estimate quantile standard
| {z error} .
|
{z
}
| {z }
b1
tn2,/2
MSres
SSxx
Sewage data: We can use the confint function in R to calculate a 95 percent condence
interval for 1 :
> confint(fit,level=0.95)
2.5 %
(Intercept)
filtration.rate
97.5 %
71.49309400 74.42399995
0.03087207
0.05119547
PAGE 163
CHAPTER 10
Interpretation: We are 95 percent condent that the population parameter 1 is between 0.0309 and 0.0511. This means
for every one unit increase in the machine ltration rate x, we are 95 percent
condent that the population mean absorption E(Y ) will increase between 0.0309
and 0.0511 percent.
Note that this interval does not contain 0 and includes only positive values. There
is strong evidence that the absorption rate Y is positively linearly related to machine
ltration rate x in the population. The condence interval gives information about how
strong this relationship is.
Hypothesis test: Under our regression model assumptions, if we wanted to formally
test
H0 : 1 = 0
versus
H1 : 1 = 0,
we would use
t=
b1
MSres
SSxx
72.958547
0.041034
0.697528 104.596
0.004837
PAGE 164
0.2
0.0
0.1
0.3
0.4
CHAPTER 10
Figure 10.3: Sewage data: t(18) pdf. This is the sampling distribution of t when H0 :
1 = 0 is true. An at t = 8.484 has been added.
b1
MSres
SSxx
0.041034
= 8.484.
0.004837
Figure 10.3 shows that t = 8.484 is not an expected outcome from the t(18) distribution,
the sampling distribution of
t=
b1
MSres
SSxx
CHAPTER 10
10.7
CHAPTER 10
In the condence interval for E(Y |x0 ), we call Yb (x0 ) a point estimator.
In the prediction interval for Y (x0 ), we call Yb (x0 ) a point predictor.
The primary dierence in the intervals arises in assessing the variability of Yb (x0 ).
Condence interval: A 100(1 ) percent condence interval for the population mean
E(Y |x0 ) is given by
Yb (x0 ) tn2,/2
MSres
]
1 (x0 x)2
.
+
n
SSxx
Prediction interval: A 100(1 ) percent prediction interval for the new response
Y (x0 ) is given by
Yb (x0 ) tn2,/2
[
MSres
]
1 (x0 x)2
1+ +
.
n
SSxx
Comparison: The two intervals have the same form and are nearly identical.
The extra 1 in the prediction intervals standard error arises from the
additional uncertainty associated with predicting a new response from the
N (0 + 1 x0 , 2 ) distribution.
Therefore, at the same value of x0 , a 100(1 ) percent prediction interval for
Y (x0 ) will necessarily be wider than the corresponding 100(1 ) percent
condence interval for E(Y |x0 ).
Interval length: The length of both intervals depends on the value of x0 .
The standard error in either interval will be smallest when x0 = x and will
get larger the farther x0 is from x in either direction.
This implies that the precision with which we estimate E(Y |x0 ) or predict
Y (x0 ) decreases the farther we get away from x.
This makes intuitive sense, namely, we would expect to have the most condence in our tted model near the center of the observed data.
PAGE 167
CHAPTER 10
Warning: It is sometimes desired to estimate E(Y |x0 ) or predict Y (x0 ) for values
of x0 outside the range of x values used in the study. This is called extrapolation
and can be very dangerous.
In order for our inferences to be valid, we must believe that the model holds
for x values outside the range where we have observed data.
In some situations, this may be reasonable. In others, we may have no theoretical basis for making such a claim without data to support it.
Example 10.1 (continued). In our sewage example, suppose that we are interested in
estimating E(Y |x0 ) and predicting a new Y (x0 ) when the ltration rate is x0 = 150
kg-DS/m/hr.
E(Y |x0 ) denotes the population mean moisture percentage when the machine ltration rate is x0 = 150 kg-DS/m/hr.
Y (x0 ) denotes the moisture percentage Y for an individual sludge specimen when
the ltration rate is x0 = 150 kg-DS/m/hr.
R automates the calculation of condence and prediction intervals, as seen below.
> predict(fit,data.frame(filtration.rate=150),level=0.95,interval="confidence")
fit
lwr
upr
lwr
upr
PAGE 168
79
80
95% confidence
95% prediction
77
78
Moisture (Percentage)
81
CHAPTER 11
80
100
120
140
160
180
200
A 95 percent condence interval for E(Y |x0 = 150) is (78.79, 79.44). When
the ltration rate is x0 = 150 kg-DS/m/hr, we are 95 percent condent that the
population mean moisture percentage is between 78.79 and 79.44 percent.
A 95 percent prediction interval for Y (x0 = 150) is (77.68, 80.55). When the ltration rate is x0 = 150 kg-DS/m/hr, we are 95 percent condent that the moisture
percentage for a single specimen will be between 77.68 and 80.55 percent.
Figure 10.4 shows 95 percent condence bands for E(Y |x0 ) and 95 percent prediction bands for Y (x0 ). These are not simultaneous bands (i.e., these are not bands
for the entire population regression function).
PAGE 169
CHAPTER 11
11
11.1
Example 11.1. The taste of matured cheese is related to the concentration of several
chemicals in the nal product. In a study from the LaTrobe Valley of Victoria, Australia,
specimens of cheddar cheese were analyzed for their chemical composition and were subjected to taste tests. For each specimen, the taste Y was obtained by combining the
PAGE 170
CHAPTER 11
Specimen
TASTE
ACETIC
H2S
LACTIC
Specimen
TASTE
ACETIC
H2S
LACTIC
12.3
4.543
3.135
0.86
16
40.9
6.365
9.588
1.74
20.9
5.159
5.043
1.53
17
15.9
4.787
3.912
1.16
39.0
5.366
5.438
1.57
18
6.4
5.412
4.700
1.49
47.9
5.759
7.496
1.81
19
18.0
5.247
6.174
1.63
5.6
4.663
3.807
0.99
20
38.9
5.438
9.064
1.99
25.9
5.697
7.601
1.09
21
14.0
4.564
4.949
1.15
37.3
5.892
8.726
1.29
22
15.2
5.298
5.220
1.33
21.9
6.078
7.966
1.78
23
32.0
5.455
9.242
1.44
18.1
4.898
3.850
1.29
24
56.7
5.855
10.20
2.01
10
21.0
5.242
4.174
1.58
25
16.8
5.366
3.664
1.31
11
34.9
5.740
6.142
1.68
26
11.6
6.043
3.219
1.46
12
57.2
6.446
7.908
1.90
27
26.5
6.458
6.962
1.72
13
0.7
4.477
2.996
1.06
28
0.7
5.328
3.912
1.25
14
25.9
5.236
4.942
1.30
29
13.4
5.802
6.685
1.08
15
54.9
6.151
6.752
1.52
30
5.5
6.176
4.787
1.25
Table 11.1: Cheese data. ACETIC, H2S, and LACTIC are independent variables. The
response variable is TASTE.
scores from several tasters. Data were collected on the following variables:
Y
CHAPTER 11
11.2
x1
x2
xk
Y1
x11
x12
x1k
2
..
.
Y2
..
.
x21
..
.
x22
..
.
..
.
x2k
..
.
Yn
xn1
xn2
xnk
Each of the n individuals contributes a response Y and a value of each of the independent
variables. The value
xij = measurement on the jth independent variable for the ith individual,
for i = 1, 2, ..., n and j = 1, 2, ..., k. For the n individuals, we write
Yi = 0 + 1 xi1 + 2 xi2 + + k xik + i ,
for i = 1, 2, ..., n.
Matrix representation: To estimate the population parameters 0 , 1 , 2 , ..., k , we
again use least squares. In doing so, it is advantageous to express multiple linear regression models in terms of matrices and vectors. This streamlines notation and makes the
presentation easier. Dene
Y=
Y1
Y2
..
.
Yn
, X =
1 xn1 xn2
x1k
x2k
, =
..
..
.
.
xnk
1 x11 x12
1 x21 x22
.. ..
..
. .
.
0
1
2
..
.
k
, =
1
2
..
.
n
CHAPTER 11
12.3
20.9
Y=
X=
..
5.5
301
1
1 5.159 5.043 1.53
..
..
..
..
2
.
.
.
.
2
=
..
30
41
301
Least Squares: The notion of least squares is the same in multiple linear regression as it
was in simple linear regression. Specically, we want to nd the values of 0 , 1 , 2 , ..., k
that minimize
Q(0 , 1 , 2 , ..., k ) =
i=1
CHAPTER 11
Solution: We want to nd the value of that minimizes Q(). Because Q() is a scalar
function of the p = k + 1 elements of , it is possible to use calculus to determine the
values of the p elements that minimize it. Formally, we can take p partial derivatives, one
with respect to each of 0 , 1 , 2 , ..., k , and set these equal to zero. Using the calculus of
matrices, we can write this resulting system of p equations (and p unknowns) as follows:
X X = X Y.
These are called the normal equations. Provided that X X is full rank, the (unique)
solution is
1
b = (X X) X Y =
b0
b1
b2
..
.
bk
This is the least squares estimator of .
Technical note: For the least squares estimator
b = (X X)1 X Y
to be unique, we need X to be of full column rank; i.e., r(X) = p = k + 1. This
will occur when there are no linear dependencies among the columns of X. If r(X) < p,
then X X does not have a unique inverse, and the normal equations can not be solved
uniquely. Statistical software such as R will alert you when X X is not full rank.
Cheese data: We now use R to calculate the least squares estimate b = (X X)1 X Y
for the cheese data in Example 11.1:
acetic
h2s
lactic
-28.877
0.328
3.912
19.670
PAGE 174
CHAPTER 11
b0
28.877
b1
0.328
.
b=
=
b2
3.912
b3
19.670
Therefore, the estimated regression model based on the data is
Yb = 28.877 + 0.328x1 + 3.912x2 + 19.670x3 ,
or, in other words,
\ = 28.877 + 0.328 ACETIC + 3.912 H2S + 19.670 LACTIC.
TASTE
11.3
i=1
e2i
(Yi Ybi )2 ,
i=1
CHAPTER 11
SSres
np
b = MSres =
SSres
np
estimates and is called the residual standard error. This result is analogous to the
simple linear regression result (see pp 161). The only dierence is in the divisor in MSres .
11.4
Identity: The following algebraic identity arises for a linear regression model t (simple
or multiple):
SStotal =
(Yi Y )
i=1
(Ybi Y )2 +
i=1
(Yi Ybi )2
i=1
= SSreg + SSres .
This information is used to produce an analysis of variance (ANOVA) table.
Table 11.2: Analysis of variance table for linear regression.
Source
df
SS
MS
Regression
p1
SSreg
MSreg =
SSreg
p1
Residual
np
SSres
MSres =
SSres
np
Total
n1
SStotal
F =
MSreg
MSres
Notes:
This table summarizes how the variability in the response data is partitioned.
PAGE 176
CHAPTER 11
SStotal is the total sum of squares. It measures the total variation in the
response data.
SSreg is the regression sum of squares. It measures the variation in the
response data explained by the estimated regression model.
SSres is the residual sum of squares. It measures the variation in the
response data not explained by the estimated regression model.
The degrees of freedom (df) add down.
The degrees of freedom for SStotal is the divisor in the sample variance
1
SStotal
S =
(Yi Y )2 =
.
n 1 i=1
n1
n
SSres
np
is an unbiased estimator of 2 .
Mean squares (MS) are the sums of squares divided by their degrees of freedom.
The F statistic is formed by taking the ratio of MSreg and MSres . More on this in
a moment.
Cheese data: I used SAS to calculate the ANOVA table for the cheese data:
Analysis of Variance
Sum of
Mean
DF
Squares
Square
F Value
Pr > F
4994.508
1664.836
16.22
<.0001
Residual
26
2668.378
102.629
Corrected Total
29
7662.886
Source
Regression
PAGE 177
CHAPTER 11
Remark: The reason I used SAS here is that R does something dierent in displaying
the analysis of variance (it breaks down the regression sum of squares further). More on
this in a moment. You can get R to produce this type of ANOVA table, but it takes
extra work and is not worth it.
Overall F test: In the multiple linear regression model
Y = 0 + 1 x1 + 2 x2 + + k xk + ,
the F statistic in the ANOVA table can be used to test
H0 : 1 = 2 = = k = 0
versus
H1 : at least one of the j s is nonzero.
In other words, F tests if at least one of the independent variables x1 , x2 , ..., xk is
important in describing the response Y in the population (H0 says no; H1 says yes). If
H0 is rejected, we do not know which one or how many of the j s are nonzero; only that
at least one is.
Sampling distribution: When H0 is true, both MSreg and MSres are unbiased estimators of 2 . Therefore, when H0 is true,
F =
MSreg
1.
MSres
MSreg
F (p 1, n p).
MSres
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
CHAPTER 11
10
15
20
Figure 11.1: Cheese data: F (3, 26) pdf. This is the sampling distribution of F when H0
is true. An at F = 16.22 has been added.
Cheese data: For the cheese data in Example 11.1, the F statistic is used to test
H0 : 1 = 2 = 3 = 0
versus
H1 : at least one of the j is nonzero.
Interpretation: Based on the F statistic (F = 16.22), and the corresponding probability
value (p-value < 0.0001), we have strong evidence to reject H0 . See also Figure 11.1.
We conclude that at least one of the independent variables (ACETIC, H2S, LACTIC) is
important in describing TASTE in the population.
Remark: In the next section, we learn how to investigate the population-level eects of
each variable separately.
PAGE 179
CHAPTER 11
SSreg
.
SStotal
Analysis of Variance
Sum of
Mean
DF
Squares
Square
F Value
Pr > F
4994.508
1664.836
16.22
<.0001
Residual
26
2668.378
102.629
Corrected Total
29
7662.886
Source
Regression
SSreg
4994.508
=
0.652.
SStotal
7662.886
Interpretation: About 65.2 percent of the variability in the TASTE data is explained by
the linear regression model that includes ACETIC, H2S, and LACTIC. The remaining 34.8
percent of the variability in the taste data is explained by other sources.
Warning: It is important to understand what R2 measures and what it does not. Its
value is computed under the assumption that the regression model is correct and assesses
how much of the variation in the response is attributed to that relationship.
PAGE 180
CHAPTER 11
If R2 is small, it may be that there is just a lot of random inherent variation in the
data. Although the estimated regression model is reasonable, it can explain only
so much of the overall variation.
Alternatively, R2 may be large (e.g., close to 1) but for an estimated model that is
not appropriate for the data. A better model may exist.
Question: How does R display an analysis of variance table? For the cheese data in
Example 11.1, R provides
> fit = lm(taste~acetic+h2s+lactic)
> anova(fit)
Df
Pr(>F)
acetic
h2s
lactic
533.26
533.26
Residuals 26 2668.38
102.63
5.1959 0.0310870 *
CHAPTER 11
SS(LACTIC) is the sum of squares added when compared to a model that includes
an intercept term, ACETIC, and H2S.
In other words, we can use the sequential sums of squares to assess the impact of adding
independent variables ACETIC, H2S, and LACTIC to the model in sequence. The p-values
provided by R help you assess the statistical signicance of each independent variable as
you add them. Small p-values suggest statistical signicance.
Interesting: If you change the order of the independent variables in the lm function,
then you will get a dierent sequential sum of squares partition. For example,
Pr(>F)
h2s
1 4376.8
lactic
617.1
617.1
6.0131
0.02123 *
acetic
0.6
0.6
0.0054
0.94193
Residuals 26 2668.4
102.6
This table suggests that ACETIC does not add signicantly to a regression model that
already includes H2S and LACTIC (p-value = 0.941). Note that the previous sequential
sum of squares partition (on the previous page) does not enable us to see this.
11.5
PAGE 182
CHAPTER 11
This can help us assess the importance of using the independent variable xj in a
model that includes the other independent variables.
That is, inference regarding the population parameter j is always conditional on
the other variables being included in the model.
Condence intervals: Under our linear regression model assumptions, a 100(1 )
percent condence interval for j , for j = 0, 1, 2, ..., k, is given by
bj tnp,/2
MSres cjj ,
where bj is the least squares estimate of j , MSres is our estimate of the error variance
1
2 , and cjj = (X X)1
matrix.
jj is the corresponding diagonal element of the (X X)
The value tnp,/2 is the upper /2 quantile from the t(n p) distribution.
Note the familiar form of the interval:
point estimate quantile standard
| {z error} .
|
{z
}
| {z }
bj
tnp,/2
MSres cjj
PAGE 183
CHAPTER 11
Cheese data: We can use the confint function in R to calculate condence intervals
for the population regression parameters:
> fit = lm(taste~acetic+h2s+lactic)
> confint(fit,level=0.95)
2.5 %
97.5 %
-8.839009
9.495026
h2s
1.345693
6.477870
lactic
1.932318 37.407035
Interpretation: I will ignore the intercept condence interval, which describes E(Y )
when x1 = x2 = x3 = 0, a nonsensical quantity. Here is how you interpret the other
condence intervals:
We are 95 percent condent that 1 (the population parameter for ACETIC) is
between 8.84 and 9.50.
This interval includes 0. Therefore, ACETIC does not signicantly add to a
model that includes H2S and LACTIC.
This rearms what we saw in the sequential SS when ACETIC was added last.
We are 95 percent condent that 2 (the population parameter for H2S) is between
1.35 and 6.48.
This interval does not include 0. Therefore, H2S does signicantly add to a
model that includes ACETIC and LACTIC.
We are 95 percent condent that 3 (the population parameter for LACTIC) is
between 1.93 and 37.41.
This interval does not include 0. Therefore, LACTIC does signicantly add
to a model that includes ACETIC and H2S.
PAGE 184
CHAPTER 11
11.6
Goals: We would like to create 100(1) percent intervals for the mean E(Y |x0 ) and for
the new value Y (x0 ). As in simple linear regression, the former is called a condence
interval (because it is for a mean response) and the latter is called a prediction interval
(because it is for a new random variable).
Cheese data: Suppose we are interested estimating E(Y |x0 ) and predicting a new
Y (x0 ) when ACETIC = 5.5, H2S = 6.0, and LACTIC = 1.4, so that
5.5
x0 = 6.0 .
1.4
We use R to compute the following:
> predict(fit,data.frame(acetic=5.5,h2s=6.0,lactic=1.4),level=0.95,interval="confidence")
fit
lwr
upr
lwr
upr
CHAPTER 11
11.7
Importance: We now discuss diagnostic techniques for linear regression (simple and
multiple). The term diagnostics refers to the process of checking the model assumptions. This is an important exercise because if the model assumptions are violated, then
our analysis and all subsequent interpretations could be compromised.
Recall: We rst recall the model assumptions on the error terms in the linear regression
model
Yi = 0 + 1 xi1 + 2 xi2 + + k xik + i ,
for i = 1, 2, ..., n. Specically, we have made the following assumptions:
E(i ) = 0, for i = 1, 2, ..., n
var(i ) = 2 , for i = 1, 2, ..., n, that is, the variance is constant
the random variables i are independent
the random variables i are normally distributed.
Residuals: In checking our model assumptions, we rst have to deal with the obvious
problem; namely, the error terms i in the model are never observed. However, after
tting the model, we can calculate the residuals
ei = Yi Ybi ,
where the ith tted value
Ybi = b0 + b1 xi1 + b2 xi2 + + bk xik .
We can think of the residuals e1 , e2 , ..., en as proxies for the error terms 1 , 2 , ..., n .
Therefore, we can use the residuals to check our model assumptions instead.
Normality: To check the normality assumption for the errors in linear regression, we
can examine the qq-plot of the residuals.
PAGE 186
10
Residuals
10
20
CHAPTER 11
N(0,1) quantiles
Figure 11.2: Cheese data. Normal qq-plot of the least squares residuals.
Recall that if the plotted points follow a straight line (approximately), this supports
the normality assumption.
Substantial deviation from linearity is not consistent with the normality assumption.
The plot in Figure 11.2 supports the normality assumption for the errors in the
multiple linear regression model for the cheese data.
CHAPTER 11
Terminology: A residual plot is a scatterplot of the residuals ei (on the vertical axis)
versus the predicted values Ybi (on the horizontal axis). A residual plot can be very useful
in detecting the following violations:
misspecifying the true regression function
i.e., a violation of the E(i ) = 0 assumption
non-constant variance (heteroscedasticity)
i.e., a violation of the var(i ) = 2 assumption
correlated observations over time; i.e., a violation of the assumption that the i s
are independent random variables.
Important: Mathematical arguments show that if all of the linear regression model
assumptions hold, then the residuals and tted values are independent.
Therefore, if the residual plot appears to be random in appearance with no noticeable patterns (i.e., the plot looks like a random scatter of points), this suggests
there are no model inadequacies.
On the other hand, if there are structural (non-random) patterns in the residual
plot, this suggests that the model is inadequate in some way.
Furthermore, the residual plot often reveals what type of model violation is occurring.
Cheese data: The residual plot in Figure 11.3 does not suggest any obvious model
inadequacies. It looks completely random in appearance.
Note: We now look at two new regression examples. We use these examples to illustrate
model violations that are commonly seen in practice. We also discuss remedies to handle
these violations.
PAGE 188
10
Residuals
10
20
CHAPTER 11
10
20
30
40
50
Fitted values
Figure 11.3: Cheese data. Residual plot for the multiple linear regression model t. A
horizontal line at zero has been added.
Residuals
10
0
15
CHAPTER 11
500
1000
1500
2000
2500
3000
3500
10
12
Fitted values
Figure 11.4: Electricity data. Left: Scatterplot of peak demand (Y , measured in kWh)
versus monthly usage (x, measured in kWh) with estimated simple linear regression line
superimposed. Right: Residual plot for the simple linear regression model t.
Problem: There is a clear problem with non-constant variance here. Note how the
residual plot fans out like the bell of a trumpet. This violation may have been missed
by looking at the scatterplot alone, but the residual plot highlights it.
Remedy: When faced with a non-constant variance violation in regression, a common
remedy is to transform the response variable Y . Common transformations are the
CHAPTER 11
W = 0 + 1 x + ,
where
W = peak hour electricity demand (measured in
kWh)
monthly.usage
0.5808309
0.0009529
Discussion: First note that applying the transformation did help to reduce the nonconstant variance problem considerably; see Figure 11.5. The noticeable fanning out
shape that we saw in the residual plot previously (i.e., based on the untransformed
response Y ) is now largely absent. Lets proceed with inference for 1 to determine if
the linear relationship is signicant for the population:
> confint(fit.2,level=0.95)
2.5 %
(Intercept)
97.5 %
0.3208043932 0.840857384
0.5
Residuals
0.5
0.0
3.0
2.5
2.0
1.0
1.5
1.0
0.5
3.5
4.0
CHAPTER 11
500
1000
1500
2000
2500
3000
3500
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Fitted values
Figure 11.5: Electricity data. Left: Scatterplot of the square root of peak demand ( Y )
versus monthly usage (x, measured in kWh) with estimated simple linear regression
line superimposed. Right: Residual plot for the simple linear regression model t with
transformed response.
Interpretation: We are 95 percent condent that the population regression parameter
1 (in the transformed model) is between 0.000756 and 0.001150.
Note that this interval does not include 0 and includes only positive values.
This suggests that peak demand (on the square root scale) and monthly usage are
positively related in the population.
Specically, for every one-unit increase in x (monthly usage measured in kWh), we
are 95 percent condent that the mean peak demand will increase between 0.000756
PAGE 192
Residuals
0.6
0.5
0.4
1.0
0.2
DC Output
1.5
0.0
2.0
0.2
CHAPTER 11
10
1.0
1.5
2.0
2.5
Fitted values
Figure 11.6: Windmill data. Left: Scatterplot of DC output Y versus wind velocity (x,
measured in mph) with least squares simple linear regression line superimposed. Right:
Residual plot for the simple linear regression model t.
Example 11.3. An engineer is investigating the use of a windmill to generate electricity.
He has collected data on
Y
CHAPTER 11
Remark: I used R to calculate the coecient of determination from the simple linear
regression model t. It is
R2 0.875.
A novice data analyst (especially one that doesnt even bother to graph the data) might
think that because this is pretty large, the model we have t is a good model.
However, it is easy to see from Figure 11.6 that a simple linear regression model is not
a good model for the data. Even though 0.875 is in fact pretty large, its value refers
specically to a model that is inappropriate.
Remedy: Fit a multiple linear regression model with two independent variables: wind
velocity x and its square x2 . The model
Y = 0 + 1 x + 2 x2 +
is called a quadratic regression model. It is straightforward to t a quadratic regression model in R. We simply regress Y on both x and x2 .
> wind.velocity.sq = wind.velocity^2
> fit.2 = lm(DC.output ~ wind.velocity + wind.velocity.sq)
> fit.2
Coefficients:
(Intercept)
wind.velocity
wind.velocity.sq
-1.15590
0.72294
-0.03812
0.0
0.2
0.5
0.1
1.0
Residuals
DC Output
1.5
0.1
2.0
0.2
CHAPTER 11
10
0.5
1.0
1.5
2.0
Fitted values
Figure 11.7: Windmill data. Scatterplot of DC output Y versus wind velocity (x, measured in mph) with least squares quadratic regression curve superimposed. Right: Residual plot for the quadratic regression model t.
Condence interval: To see if the quadratic eect between DC output and wind
velocity is signicant, we can write a condence interval for 2 , the population parameter
in the quadratic regression model that describes the quadratic eect.
> confint(fit.2,level=0.95)
2.5 %
(Intercept)
wind.velocity
97.5 %
-1.51810023 -0.79369625
0.59554751
0.85032429
0.0
Residuals
0.6
0.2
0.4
0.1
0.2
Residuals
0.0
0.1
0.2
0.2
CHAPTER 12
N(0,1) quantiles
N(0,1) quantiles
Figure 11.8: Windmill data. QQ plots of the residuals. Left: Simple linear regression t.
Right: Quadratic regression t.
Remark: I used R to calculate the coecient of determination from the quadratic
regression model t. It is
R2 0.968.
This means that about 96.8 percent of the variability in the DC output data is explained
by the estimated model that includes both wind velocity and (wind velocity)2 . The
remaining 3.2 percent is not explained by the estimated model. This is an improvement
over the largely meaningless R2 0.875 calculated from the simple linear regression.
New Problem: Figure 11.8 shows the normal qq plots for the simple linear regression
model t (left) and the quadratic model t (right).
I say new problem, because now it looks like the normality assumption (for the
quadratic model) is violated. Interestingly, this was not a problem with the simple
linear regression model.
It appears that tting the quadratic regression model xed one problem (i.e., selecting a better regression function) but created another (normality violation).
PAGE 196
CHAPTER 12
12
12.1
Factorial Experiments
Introduction
Importance: In engineering experiments, particularly those carried out in manufacturing settings, there are often several variables of interest and the goal is to understand
the eects of these variables on a continuous response Y (e.g., yield, lifetime, ll weights,
etc.). A factorial treatment structure is an ecient way of dening treatments in
these types of experiments.
One example of a factorial treatment structure uses k factors, where each factor
has two levels. This is called a 2k factorial experiment.
Factorial experiments are common in the early stages of experimental work. For
this reason, they are also called factor screening experiments.
Example 12.1. A nickel-titanium alloy is used to make components for jet turbine
aircraft engines. Cracking is a potentially serious problem in the nal part, as it can lead
to nonrecoverable failure. A test is run at the parts producer to determine the eect of
k = 4 factors on cracks: pouring temperature (A), titanium content (B), heat treatment
method (C), and amount of grain rener used (D).
Factor A has 2 levels: low temperature and high temperature
Factor B has 2 levels: low content and high content
Factor C has 2 levels: Method 1 and Method 2
Factor D has 2 levels: low amount and high amount.
The response variable in the experiment is
Y = length of largest crack (in mm) induced in a piece of sample material.
PAGE 197
CHAPTER 12
Note: In this example, there are 4 factors, each with 2 levels. Thus, there are
2 2 2 2 = 24 = 16
dierent treatment combinations. These are listed here:
a1 b1 c1 d1
a1 b2 c1 d1
a2 b1 c1 d1
a2 b2 c1 d1
a1 b1 c1 d2
a1 b2 c1 d2
a2 b1 c1 d2
a2 b2 c1 d2
a1 b1 c2 d1
a1 b2 c2 d1
a2 b1 c2 d1
a2 b2 c2 d1
a1 b1 c2 d2
a1 b2 c2 d2
a2 b1 c2 d2
a2 b2 c2 d2
For example, the treatment combination a1 b1 c1 d1 holds each factor at its low level, the
treatment combination a1 b1 c2 d2 holds Factors A and B at their low level and Factors
C and D at their high level, and so on.
Terminology: In a 2k factorial experiment, one replicate of the experiment uses 2k
runs, one at each of the 2k treatment combinations.
Therefore, in Example 12.1, one replicate of the experiment would require 16 runs
(one at each treatment combination listed above).
Two replicates would require 32 runs, three replicates would require 48 runs, and
so on.
CHAPTER 12
(k )
0
(k )
1
(k )
2
(k )
3
k(k1)
2
Note that
( ) ( ) ( )
( )
k ( )
k
k
k
k
k
+
+
+ +
=
= 2k
0
1
2
k
j
j=0
(k ) (k )
(k )
and additionally that 0 , 1 , ..., k are the entries in the (k + 1)st row of Pascals
Triangle. Observe also that 2k grows quickly in size as k increases. For example, if there
are k = 10 factors (A, B, C, D, E, F, G, H, I, and J, say), then performing just one
replicate of the experiment would require 210 = 1024 runs! In real life, rarely would this
type of experiment be possible.
12.2
Remark: We rst consider 2k factorial experiments where k = 2, that is, there are only
two factors, denoted by A and B. This is called a 22 experiment. We illustrate with an
agricultural example.
Example 12.2. Predicting corn yield prior to harvest is useful for making feed supply
and marketing decisions. Corn must have an adequate amount of nitrogen (Factor A)
and phosphorus (Factor B) for protable production and also for environmental concerns.
PAGE 199
CHAPTER 12
Yield (Y )
a1 b1
30
a1 b2
36
a2 b1
32
a2 b2
44
575
191.67
Residuals 16
320
20.00
Pr(>F)
PAGE 200
40
35
25
30
Yield (bushels/plot)
45
CHAPTER 12
a1b1
a1b2
a2b1
a2b2
Figure 12.1: Boxplots of corn yields (bushels/plot) for four treatment groups.
Uninteresting conclusion: The value F = 9.5833 is not what we would expect from
an F (3, 16) distribution, the distribution of F when
H0 : 11 = 12 = 21 = 22
is true (p-value 0.0007). Therefore, we conclude that at least one of the factorial
treatment population means is dierent.
Remark: As we have discussed before in one-way analyses, the overall F test provides
very little information. However, with a factorial treatment structure, it is possible to
explore the data further. We can target the (main) eects due to nitrogen (Factor A)
and due to phosphorus (Factor B) individually. We can also determine if the two factors
nitrogen and phosphorus interact.
PAGE 201
CHAPTER 12
Partition: Let us rst recall the treatment sum of squares from the one-way ANOVA:
SStrt = 575.
The way we learn more about specic eects is to partition SStrt into the following pieces:
SSA , SSB , and SSAB . By partition, I mean that we will write
SStrt = SSA + SSB + SSAB .
In words,
SSA is the sum of squares due to the main eect of A (nitrogen)
SSB is the sum of squares due to the main eect of B (phosphorus)
SSAB is the sum of squares due to the interaction eect of A and B (nitrogen and
phosphorus).
Two-way analysis: We can use R to write this partition in a richer ANOVA table:
> fit = lm(yield ~ nitrogen + phosphorus + nitrogen*phosphorus)
> anova(fit)
Analysis of Variance Table
Df Sum Sq Mean Sq F value
Pr(>F)
nitrogen
125
125
6.25 0.0236742 *
phosphorus
405
405
nitrogen:phosphorus
45
45
16
320
20
Residuals
2.25 0.1530877
Interpretation: The F statistics, say FA , FB , and FAB , can be used to determine if the
respective eects are signicant in the population. Small p-values (e.g., p-value < 0.05)
indicate that the eect is signicant. Eects with large p-values can be treated as not
signicant.
PAGE 202
0.0
0.2
0.4
0.6
0.8
1.0
CHAPTER 12
Figure 12.2: Corn yield data. F (1, 16) pdf. An at FAB = 2.25 has been added.
Analysis: When analyzing data from a 22 factorial experiment, the rst task is to
determine if the interaction eect is signicant in the population. For the corn yield data
in Example 12.2, we see that
FAB = 2.25 (p-value 0.153).
This value of FAB is not all that unreasonable when compared to the F (1, 16)
distribution, the sampling distribution of FAB when nitrogen and phosphorus do
not interact (i.e., when the population-level interaction eect is zero).
In other words, we do not have substantial (population-level) evidence that nitrogen
and phosphorus interact.
PAGE 203
50
CHAPTER 12
phosphorus
40
35
30
20
25
45
b2
b1
a1
a2
nitrogen
Figure 12.3: Interaction plot for nitrogen and phosphorus in Example 12.2.
NOTE : An interaction plot is a graphical display that can help us assess (visually)
whether two factors interact. In this plot, the levels of Factor A are marked on the
horizontal axis. The sample means of the treatments are plotted against the levels of A,
and the points corresponding to the same level of Factor B are joined by straight lines.
If Factors A and B do not interact in the population, the interaction plot should
display parallel lines.
That is, the eect of one factor on the response Y stays constant across the
levels of the other factor. This is essentially what it means to have no interaction.
If the interaction plot displays a signicant departure from parallelism (including an
overwhelming case where the lines cross), then this is visual evidence of interaction.
PAGE 204
CHAPTER 12
That is, the eect of one factor on the response Y depends on the levels of the
other factor.
The F test that uses FAB provides numerical evidence of interaction. The interaction plot provides visual evidence.
Conclusion: We have already used the interaction test statistic FAB to conclude that
the interaction eect of nitrogen and phosphorus is not signicant in the population.
Although the interaction plot in Figure 12.3 is not parallel (remember, it is constructed
from the sample data), the departure from parallelism is not statistically signicant.
1. Start by looking at whether the interaction eect is signicant. This can be done by
using an interaction plot and an F test that uses FAB .
2. If the interaction is signicant, then formal analysis of main eects is not all that
meaningful because their interpretations depend on the interaction.
In this situation, the best approach is to just ignore the factorial treatment structure
and redo the entire analysis as a one-way ANOVA with four treatments.
Tukey pairwise condence intervals can help you formulate an ordering among the
4 treatment population means.
3. If the interaction is not signicant, I prefer to re-estimate the model without the
interaction term and then examine the main eects. This can be done numerically by
examining the sizes of FA and FB , respectively.
Condence intervals for dierences A1 A2 and B1 B2 can be used to quantify
the size of these eects in the population.
4. Check model assumptions!
PAGE 205
CHAPTER 12
Corn yield data: Because the nitrogen/phosphorus interaction is not signicant, I redid
the ANOVA leaving out the interaction term:
> fit.2 = lm(yield ~ nitrogen + phosphorus)
> anova(fit.2)
Analysis of Variance Table
Df Sum Sq Mean Sq F value
Pr(>F)
nitrogen
125
125.00
5.8219 0.027403 *
phosphorus
405
Residuals
17
365
21.47
Comparing this to the ANOVA table that includes interaction (pp 202). Note that the
interaction sum of squares SSAB = 45 from that ANOVA table has now been absorbed
into the residual sum of squares in the no-interaction analysis. Furthermore,
the main eect of nitrogen (Factor A) is signicant in describing yield in the population (FA = 5.82, p-value = 0.027).
the main eect of phosphorus (Factor B) is signicant in describing yield in the
population (FB = 18.86, p-value = 0.0004).
Condence intervals: A 95 percent condence interval for A1 A2 , the dierence in
the population means for the two levels of nitrogen (Factor A) is
(
)
1
1
(Y A1 Y A2 ) t17,0.025 MSres
+
.
10 10
A 95 percent condence interval for B1 B2 , the dierence in population means for
the two levels of phosphorus (Factor B) is
(Y B1 Y B2 ) t17,0.025
MSres
)
1
1
+
.
10 10
95% CI for B1 B2 :
45
40
35
Yield (bushels/plot)
20
25
30
40
35
30
20
25
Yield (bushels/plot)
45
50
50
CHAPTER 12
Nitrogen.1
Nitrogen.2
Phosphorus.1
Phosphorus.2
Figure 12.4: Left: Side by side boxplots of corn yields for nitrogen (Factor A). Right:
Side by side boxplots of corn yields for phosphorus (Factor B).
Interpretation:
We are 95 percent condent that the dierence in the population mean yields (for
low nitrogen/high nitrogen) is between 9.37 and 0.62 bushels per acre.
Note that this interval does not include 0, and includes only negative values.
This suggests that the population mean yield at the high level of nitrogen is
larger than the population mean yield at the low level of nitrogen.
We are 95 percent condent that the dierence in the population mean yields (for
low phosphorus/high phosphorus) is between 13.37 and 4.63 bushels per acre.
Note that this interval does not include 0, and includes only negative values.
This suggests that the population mean yield at the high level of phosphorus
is larger than the population mean yield at the low level of phosphorus.
Normal qq plots (not shown) for data look ne.
PAGE 207
CHAPTER 12
12.3
Example 12.3. A chemical product is produced in a pressure vessel. A factorial experiment is carried out to study the factors thought to inuence the ltration rate of this
product. The four factors are temperature (A), pressure (B), concentration of formaldehyde (C) and stirring rate (D). Each factor is present at two levels (e.g., low and
high). A 24 experiment is performed with one replication; the data are shown below.
Factor
Filtration rate
Run
Run label
(Y , gal/hr)
a1 b1 c1 d1
45
a2 b1 c1 d1
71
a1 b2 c1 d1
48
a2 b2 c1 d1
65
a1 b1 c2 d1
68
a2 b1 c2 d1
60
a1 b2 c2 d1
80
a2 b2 c2 d1
65
a1 b1 c1 d2
43
10
a2 b1 c1 d2
100
11
a1 b2 c1 d2
45
12
a2 b2 c1 d2
104
13
a1 b1 c2 d2
75
14
a2 b1 c2 d2
86
15
a1 b2 c2 d2
70
16
a2 b2 c2 d2
96
PAGE 208
CHAPTER 12
Note: In this experiment, there are k = 4 factors, so there are 16 eects to estimate:
the 1 overall mean
the 4 main eects: A, B, C, and D
the 6 two-way interactions: AB, AC, AD, BC, BD, and CD
the 4 three-way interactions: ABC, ABD, ACD, BCD
the 1 four-way interaction: ABCD.
In this 24 experiment, we have 16 values of Y and 16 eects to estimate. This leaves us
with no observations (and therefore no degrees of freedom) to perform statistical tests.
This is an obvious problem! We have no way to judge which main eects are signicant,
and we cannot learn about how these factors interact.
Terminology: A single replicate of a 2k factorial experiment is called an unreplicated
factorial. With only one replicate, there is no internal error estimate, so we cannot
perform statistical tests to judge signicance. What do we do?
One approach is to assume that certain higher-order interactions are negligible
and then combine their sum of squares to estimate the error.
This is an appeal to the sparsity of eects principle, which states that most
systems are dominated by some of the main eects and low-order interactions and
that most high-order interactions are negligible.
To learn about which eects may be negligible, we can t the full ANOVA model
and obtain the sum of squares (SS) for each eect (see next page).
Eects with large SS can be retained. Eects with small SS can be discarded.
A smaller model with only the large eects can then be t. This smaller model
will have an error estimate formed by taking all of the eects with small SS and
combining them together.
PAGE 209
CHAPTER 12
1 1870.56 1870.56
39.06
39.06
390.06
390.06
855.56
855.56
A:B
0.06
0.06
A:C
1 1314.06 1314.06
B:C
A:D
1 1105.56 1105.56
B:D
0.56
0.56
C:D
5.06
5.06
A:B:C
14.06
14.06
A:B:D
68.06
68.06
A:C:D
10.56
10.56
B:C:D
27.56
27.56
A:B:C:D
7.56
7.56
Residuals
0.00
22.56
22.56
Warning message:
In anova.lm(fit) :
ANOVA F-tests on an essentially perfect fit are unreliable
d2
d1
80
70
60
40
40
50
60
70
90
Factor.D
c1
c2
80
Factor.C
50
90
100
100
CHAPTER 12
a1
a2
a1
Factor.A
a2
Factor.A
Figure 12.5: Interaction plots in Example 12.3. Left: AC. Right: AD.
Analysis: Here is the R output summarizing the t of the smaller model that includes
only A, C, D, AC, and AD:
> # Fit smaller model
> fit = lm(filtration ~ A + C + D + A*C + A*D)
> anova(fit)
Analysis of Variance Table
Df
Pr(>F)
1 1870.56 1870.56
390.06
390.06
19.990
855.56
855.56
A:C
1 1314.06 1314.06
A:D
1 1105.56 1105.56
Residuals 10
195.13
0.001195 **
19.51
Note: Each of these ve eects is strongly signicant in the population. The AC (temperature/concentration of formaldehyde) and AD (temperature/stirring rate) interaction
plots in Figure 12.5 each show strong interaction.
PAGE 211
CHAPTER 12
Regression analysis: In Example 12.3, even though there are no numerical values
attached to the levels of temperature (Factor A), concentration of formaldehyde (Factor
C), and stirring rate (Factor D), we can still use regression to estimate a population-level
model. Specically, we can introduce the following variables with arbitrary numerical
codings assigned:
x1 = temperature (1 = low; 1 = high)
x2 = concentration of formaldehyde (1 = low; 1 = high)
x3 = stirring rate (1 = low; 1 = high).
With these values, we can t the multiple linear regression model
Y = 0 + 1 x1 + 2 x2 + 3 x3 + 4 x1 x2 + 5 x1 x3 + .
Doing so in R gives
temp
conc
stir
temp:conc
temp:stir
70.062
10.812
4.938
7.313
-9.062
8.312
Therefore, the estimated regression model for the ltration rate data is
Yb = 70.062 + 10.812x1 + 4.938x2 + 7.313x3 9.062x1 x2 + 8.312x1 x3
or, in other words,
F[
ILT = 70.062+10.812 TEMP+4.938 CONC+7.313 STIR9.062 TEMP*CONC+8.312 TEMP*STIR
Provided that our model assumptions hold, this tted regression model can be used to
estimate the mean ltration rate (or predict a future ltration rate) at specic combinations of temperature (1), concentration (1), and stirring rate (1).
PAGE 212