Inferential Statistics Last

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 53

Sampling distribution and

Inferential Statistics
Sampling Distributions
Sampling distribution can be defined as the probability
distribution of a sample statistic that is formed when samples of
size n are repeatedly taken from a population.

Sample Sample
Sample
Sample
Sample Sample

Sample Sample
Population

05/27/2020 Biostat for Postgraduate 2


If the sample statistic is the sample mean, then the distribution is
the sampling distribution of sample means.

Sample 3 Sample 4 Sample 5


Sample 2 x̄3 x̄4 x̄5
Sample 1
x̄2
x̄1

Sample 9
Sample 6 Sample 8
Sample 7 x̄9
x̄6 x̄8
x̄7

The sampling distribution consists of the values of the sample


means, x1,̄ x2,̄ x3,̄ x4,̄ x5,̄ x6,̄ x7,̄ x8,̄ x9,̄
05/27/2020 Biostat for Postgraduate 3
In practice we do not take repeated samples from a population

i.e. We do not encounter sampling distribution empirically,

However, yet it is necessary to know their properties in order

to draw statistical inferences.

05/27/2020 Biostat for Postgraduate 4


Properties of sampling distribution
• Properties of Sampling Distributions of Sample Means
 The mean of the sample means is equal to the population mean.
μ = μ x

 The standard deviation of the sample means is equal to the


population standard deviation σ
divided by the square root of n.
σ =x
n

 Hence, the sampling distribution of averages has a smaller variance


than the underlying population.

 The standard deviation of the sampling distribution of the sample


means is called the standard error of the mean.

05/27/2020 Biostat for Postgraduate 5


Sampling Distribution of Sample Means

Example:
The weight in Kg of under five children have values {5, 10, 15, 20}.
These values are written on slips of paper and put in a hat.
a. Find the mean, standard deviation, and variance of the weight
of the population.

Population μ = 12.5
5
10 σ = 5.59
15
20 σ 2 = 31.25

05/27/2020 Biostat for Postgraduate 6


Sampling Distribution of Sample Means
Example continued:
Now assume that two slips are randomly selected, with replacement.
b. List all the possible samples of size n = 2 and calculate the mean
of each.

Sample Sample mean, x Sample Sample mean, x


5, 5 5 15, 5 10 These means
5, 10 7.5 15, 10 12.5 form the
5, 15 10 15, 15 15 sampling
5, 20 12.5 15, 20 17.5 distribution of
10, 5 7.5 20, 5 12.5 the sample
10, 10 10 20, 10 15 means.
10, 15 12.5 20, 15 17.5
10, 20 15 20, 20 20

05/27/2020 Biostat for Postgraduate 7


Sampling Distribution of Sample Means

Example continued:
c. Now create the probability distribution of the sample means.
Probability Distribution of
Sample Means

x f Probability
5 1 0.0625
7.5 2 0.1250 The mean and variance of this
distribution would be 12.5 and
10 3 0.1875 31.25 /2 respectively
12.5 4 0.2500
15 3 0.1875
17.5 2 0.1250
20 1 0.0625

05/27/2020 Biostat for Postgraduate 8


Sampling Distribution of Sample Means
Example continued:
d. Then graph the probability histogram for the sampling
distribution.

P(x) Probability Histogram of


Sampling Distribution
0.25
Probability

0.20
0.15 The shape of the graph is
symmetric and bell shaped.
0.10
0.05 It approximates a normal
x distribution.
5 7.5 10 12.5 15 17.5 20
Sample mean
05/27/2020 Biostat for Postgraduate 9
The Central Limit Theorem
If a sample of size n  30 is taken from a population with any type of
distribution that has a mean =  and standard deviation = ,

x x
 
the sample means will have a normal distribution.
xx
x x
x x x
x x x x x x

05/27/2020 Biostat for Postgraduate 10
Central Limit theorem cont…
An illustration showing how a sample size determines
the shape of the sampling distribution

Population

Distribution of samples’ with different size


05/27/2020 Biostat for Postgraduate 11
The Central Limit Theorem cont..
If the population itself is normally distributed, with mean = 
and standard deviation = ,

x

the sample means will have a normal distribution for any
sample size n.
xx
x x
x x x
x x x x x
x

05/27/2020 Biostat for Postgraduate 12
The Central Limit Theorem cont..

 If the sample statistic is a proportion, provided n is large the

sample proportions will be distributed normally with mean p

and standard deviation called the standard error of the

proportion.

05/27/2020 Biostat for Postgraduate 13


The Mean and Standard Error
Example: The weights of one year old children in a certain region
are distributed with a mean weight of 8 kg and a standard
deviation of 0.7 kg. 38 children are randomly selected from the
population, and the mean of each sample is determined.

a. Find the mean and standard error of the mean of the sampling
distribution.

Standard deviation
Mean (standard error)
μx  μ σx 
σ
=8 n
0.7 = 0.11
=
38
05/27/2020 Biostat for Postgraduate 14
Interpreting the Central Limit Theorem
Example continued:

The mean of the sampling distribution is 8 kg ,and the standard


error of the sampling distribution is 0.11 kg.

From the Central Limit Theorem,


because the sample size is greater than
30, the sampling distribution can be
approximated by the normal
distribution. x
7.6 8 8.4

μx = 8 σ x = 0.11
05/27/2020 Biostat for Postgraduate 15
Important points about assumptions of normal
distribution
We use the normal distribution to a sample if:
 our sample is taken from a normally distributed population whose
population standard deviation () is known, or
 the sample size is large (i.e. greater than 30) so that we can use the Central
Limit Theorem (CLT).
We use the student t-distribution (t- statistic) provided that we have the following
three conditions satisfied:
 the sample is from a normally distributed population,
 Population variance is unknown, and
 the sample size is small i.e. less than 30. (Note that we can also use t-test
even if n> 30 if we want to be more conservative!)

 Otherwise we use non parametric test


05/27/2020 Biostat for Postgraduate 16
Student’s t distributions cont…

Student’s t distributions are


Uni-modal;
 Asymptotic to the horizontal axis;
Symmetrical about zero,
Dependent on n or the degrees of freedom (v = n-1);
If degree of freedom or v is large, approximately the same
as the standard normal distribution

05/27/2020 Biostat for Postgraduate 17


05/27/2020 Biostat for Postgraduate 18
Estimation
19
Point estimation
 From a single sample, we can calculate a sample statistic to
estimate a single parameter (a point estimate).

 Point estimate for population mean µ is


n

i =1
xi
x =
n
 Point estimate for population proportion is given by
 x
p =
n
where x is the total number of success (events)
05/27/2020 Biostat for Postgraduate
Point estimation cont…

Population Parameters Corresponding


sample statistics

Mean μ

Proportion P

Variance

05/27/2020 Biostat for Postgraduate 20


Confidence Interval estimate
21

Here the problem with point estimate is that estimates vary from sample
to sample.

So we need to take into account the sample to sample variation of a


statistic, and measure how precise our estimate is

Standard error would help us to measure the variability of estimates and


its precision

Therefore, unlike point estimate, Confidence Interval (CI) estimation


considers both precision and variability in parameter estimation.

05/27/2020 Biostat for Postgraduate


Confidence interval …
22

 Formula: Estimate ± K × Standard Error, where k is reliability


coefficient

 It gives a range of values that we belive it containes a true value

 It is always stated in terms of level of confidence, 100(1 – α)%

 Most commonly used is the 95% CI, however 90% and 99% CI are
sometimes used.

 Therefore, CI takes into account the sample to sample variation of the


statistic and gives the measure of precision.

05/27/2020 Biostat for Postgraduate


Confidence interval …
A (1-α) 100% confidence interval to estimate population mean  is (if the sample is from

normal distribution)


If σ is unknown and the sample is from normal distribution, (whether n is small or large) we
use t-test by substituting σ with S


Of course in this last situation, if n is large we can approximate it by Z test. However, if
more accuracy is desired, we use t-test

05/27/2020 Biostat for Postgraduate 23


Confidence interval …
 A (1-α) 100% confidence interval to estimate population proportion
p is

Where is sample proportion estimate

05/27/2020 Biostat for Postgraduate 24


Mean Example
25
 The blood lead level measured in μg/dl for 88 sample individuals
living in a region are given as follows;

20,21, 22,22,23,23,23,24,24,24,24,25,25,25,25,25,26,26,26,26,26,27,
27,27,27,27,27,28,28,28,28,28,28,28,28,29,29,29,29,29,30,30,30,30,
30,30,30,30,30,31,31,31,31,31,31,31,32,32,32,32,32,33,33,33,33,33,
33,33,34,34,34,34,35,35,35,35,36,36,36,36,36,37,37,37,37,38,38,39

 Construct the 90%, 95%, 98% confidence interval for population mean
 Solution: First the sample mean and standard deviation is calculated as 29.92
and 4.52 respectively.
Then we use Z-distribution as: 29.92 + Zα/2x4.52/√88
90% CI: (29.92 +1.645x4.52/9.4) =(29.13, 30.71)
95% CI: (29.92 +1.96x4.52/9.4) = (28.98, 30.86)
98% CI: (29.92 +2.33x4.52/9.4) = (28.80, 31.04)
05/27/2020 Biostat for Postgraduate
CI for proportions
Example
26

In the study of childhood abuse in psychiatry patients, a researcher found that


166 in a sample of 947 patients reported histories of physical or sexual abuse.
a. Constructs the 95% confidence interval
Solution (a)
The 95% CI for P is given by

 p (1  p )
p  z
2 n
0.175  0.825
 0.175  1.96 
947
 0.175  1.96  0.0124
 [0.151 ; 0.2]

05/27/2020 Biostat for Postgraduate


Hypothesis Testing
27

 The formal process of hypothesis testing provides us with a


means of answering research questions.

 Hypothesis is a testable statement that describes the nature of


the proposed relationship between two or more variables of
interest.

 The purpose of the study is to collect data which will allow the
researcher to test the hypothesis.

05/27/2020 Biostat for Postgraduate


Flow of hypothesis testing
28

05/27/2020 Biostat for Postgraduate


Null and Alternative hypotheses
29

 Null hypothesis (HO) is the statement about the value of the


population parameter.

 It postulates that ‘there is no association between factor and


outcome’ or ‘there is no an intervention effect’.

 Alternative hypothesis (HA) states the ‘opposing’ view that ‘there


is association between factor and outcome’ or ‘there is an
intervention effect’.

 Hypotheses are often stated in a null form, so as to allow them to


be refuted.

05/27/2020 Biostat for Postgraduate


Steps in hypothesis testing
30

1 2
Choose α. The value should be small, usually 5%.
Identify the null hypothesis H0 It is important to consider the consequences of
and the alternate hypothesis HA. both types of errors.

Compare the observed value of the statistic to


3 the critical value obtained for the chosen α.
Select the test statistic and
determine its value from the
sample data. This value is called
the observed value of the test 5
statistic.
Make a decision.

05/27/2020 Biostat for Postgraduate


Types of testes
31

1 H 0 :    0 ( P  P0 )
H A :    0 ( P  P0 ), two tailed test

2 H 0 :    0 ( P  P0 )
H A :    0 ( P  P0 ), Left tailed test

3 H 0 :    0 ( P  P0 )
H A :    0 ( P  P0 ), Right tailed test

05/27/2020 Biostat for Postgraduate


Test Statistics
32

 A test statistics is a value we can compare with known


distribution of what we expect when the null hypothesis is
true.

 The general formula of the test statistics is:

Observed _ Hypothesized
 Test statistic = Value Value
Standard Error of
observed value

05/27/2020 Biostat for Postgraduate


Decision Rule Based on the test statistic
If the calculated test statistic is:

 in the rejection region, then we reject H0.


 Otherwise we fail to reject H0 and conclude that there is not enough
evidence to reject it

Fail to reject Ho. Fail to reject Ho.

Reject Ho. Reject Ho.


z z
zcal < ztab Z tab 0 0 ztab zcal > ztab
Fail to reject Ho.
Left-Tailed Test Right-Tailed Test
Reject Ho. Reject Ho.
z
zcal < ztab ztab 0 ztab zcal > ztab
Two-Tailed Test
05/27/2020 Biostat for Postgraduate 33
Decision ……..
Reject null hypothesis if:
zcal  ztab, For left tailed test

zcal  ztab, For right tailed test

| zcal | ztab, For right tailed test

05/27/2020 Biostat for Postgraduate 34


Types of errors
35
There are 2 types of errors
Type of H0 true H0 false
decision
Reject H0 Type I error Correct decision

Accept H0 Correct decision Type II error


 Type I error is considered as more serious error; however, it
should depend on the question.
 Need to think about individual and population health implications
and costs

05/27/2020 Biostat for Postgraduate


Type I and Type II errors
36

05/27/2020 Biostat for Postgraduate


Type I and Type II errors cont…
37

05/27/2020 Biostat for Postgraduate


Type I and Type II errors cont …
38
 Power is the probability of rejecting false null hypothesis and it is
given by 1-β

 Level of significance is probability of rejecting null hypothesis


when it is true

05/27/2020 Biostat for Postgraduate


P-value
39
 In most applications, the outcome of performing a hypothesis test
is to produce a p-value.

 If the null hypothesis is true, a P-value (or probability value) of a


hypothesis test is the probability of obtaining a sample statistic
with a value as extreme or more extreme than the one
determined from the sample data.

 Or a p-value is the probability of getting the observed difference,


or more extreme, in the sample purely by chance from a
population where the true difference is zero.

05/27/2020 Biostat for Postgraduate


The P- Value…
40
P-value is the area to the extreme of a calculated test statistic in a
normal curve for one tail test and twice this area in two tailed test

But for what values of p-value should we reject the null hypothesis?

By convention, a p-value of 0.05 or smaller is considered sufficient


evidence for rejecting the null hypothesis.

By using p-value of 0.05, we are allowing a 5% chance of


wrongly rejecting the null hypothesis when it is in fact true.

When the p-value is less than 0.05, we often say that the result is
statistically significant.
05/27/2020 Biostat for Postgraduate
P-value and confidence interval
41

 Confidence intervals and hypothesis testing (p-value) are based


upon the same theory and mathematics, and

 Both will lead to the same conclusion about whether a population


difference exists.

 However, confidence intervals give information about precision,


the size in the population, and significance of the test
 It also (very usefully) indicate the amount of uncertainty remaining
about the size of the difference.
 Moreover, p-value will tell us extent of significance of the test

05/27/2020 Biostat for Postgraduate


Example on Hypothesis Testing
42

 A researcher claims that the mean of the IQ for 16 students which are normally distributed is
110 and the expected value for all population is 100 with standard deviation of 10. Test the
hypothesis .
Solution
Ho:µ=100
VS (Step-1)
HA:µ≠100
Assume α=0.05 (Step-2)
Z = (110-100)4/10=4 (Step-3)
Z-critical at 0.025 is equal to 1.96.
And 4≥1.96 (Step-4)
Conclusion: Reject the null hypothesis (Step -5)


05/27/2020 Biostat for Postgraduate
Hypothesis testing for proportions
Example
43

In the study of childhood abuse in psychiatry patients, a researcher found that


166 in a sample of 947 patients reported histories of physical or sexual abuse.
a. Test the hypothesis that the true population proportion is 30%?
Solution (a)
To test the hypothesis we need to follow the steps
Step 1: State the hypothesis
Ho: P=Po=0.3
Ha: P≠Po ≠0.3
Step 2: Fix the level of significant (α=0.05)
Step 3: Compute the calculated and tabulated value of the test statistic

05/27/2020 Biostat for Postgraduate


Example cont…

p  Po 0.175  0.3  0.125
44
zcal     10.125
p (1  p ) 0.175(0.825 0.0149
n 947
ztab  1.96
Step 4: Comparison of the calculated and tabulated values of
the test statistic

Since the tabulated value is smaller than the calculated value of


the test, we reject the null hypothesis.

Step 5: Conclusion
Hence we concluded that the proportion of childhood abuse in
psychiatry patients is different from 0.3
05/27/2020 Biostat for Postgraduate
Example: one-tailed test
 A gynecologist says that girls at birth, averaging less than 51 cm
 His colleague Judge reproach him that his claim is based on a
prejudice, and that the average length is 51 cm indeed
 They draw a sample of 100 girls and obtained summary results:

 mean 50.8 and variance 1.6

05/27/2020 Biostat for Postgraduate studies 45


Testing problem

The choice of H1 reflects here the assertion of the first gynecologist

H 0 :   51 Null hypothesis

H 1 :   51 Alternative hypothesis

One-sided test

05/27/2020 Biostat for Postgraduate studies 46


Test statistic
 We now supplement the sample values ​and find:

x  51 50.8  51
  1.58
s2 1.6
n 100
 Conclusion: At significance level of 5%, the length of girls at
birth is 51 cm.

-1.645 -1.58
]  ,1.645]

rejection region to rejection region


05/27/2020 Biostat for Postgraduate studies 47
Example: Two-tailed test

 At a certain university one takes many years an intelligence-off


normally distributed with mean score results yields 115 (115 = μ
under H0)

 An administrator wants now for the new class to test the


hypothesis that the mean is the same as in previous years
 He takes a sample of size 50, and:    mean 118 and variance 98.

n
x S2

05/27/2020 Biostat for Postgraduate studies 48


Testing problem

The choice of H1 is to be determined by the consideration that the


administrator has no idea whether the new crop is better or worse
than the previous

H 0 :   115 Null hypothesis

H1 :   115 Alternative hypothesis

two-tailed test

05/27/2020 Biostat for Postgraduate studies 49


Rejection region (sided test)

x  115 118  115


  2.14 The test statistic
s2 98
n 50

2.14 > 1.96 Administrator rejects H0 at significance level 0.05

[1.96, [
-1.96 1.96 2.14
]  ,1.96]

Rejection region Acceptance region Rejection region


05/27/2020 Biostat for Postgraduate studies 50
Example 2: Two-tailed test

 In a sample of 1000 women from the aged 50 to 54 whose


mother had breast cancer, 40 were found with breast cancer

 Suppose that the overall prevalence for breast cancer in women


of that age (regardless of their family history) is 2%.

05/27/2020 Biostat for Postgraduate studies 51


Example: Rejection region

Since 4.52 is not in the rejection region (4.52> 1.96), we reject the
null hypothesis at 5% significance level
P H
0
 4.52
 H 0 (1   H 0 )
1000

Rejection Region Acceptance region Rejection Region


-1.96 1.96 4.52

05/27/2020 Biostat for Postgraduate studies 52


Example: p-value

The answer expressed with a p-value is as follows:

p  value  0.000

The result is significant

05/27/2020 Biostat for Postgraduate studies 53

You might also like