Chapter 2
Chapter 2
Chapter 2
week collected the following data. The temperatures for the five days of the week in the three
cities were City 1: 25 24 23 26 17 City2: 22 21 24 22 20 City3: 32 27 35 24 28 Which city
have the most consistent temperature, based on these data?
check_circle
Expert Answer
star
star
star
star
star
1 Rating
Step 1
We have given the temperature for the five days of week in the three cities,
now to find which city have the more consistent temperature based on given data we will use
C.V. (Coefficient of variation).
Coefficient of Variation (C.V): Is defined as the ratio of standard deviation to the mean
usually expressed as percent.
C.V=SX*100%where,S: standard deviation S=∑(Xi−X)2n−1⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯√X: meanX=∑(Xi)n
n: sample size.
The distribution having less C.V is said to be less variable or more consistent
Step 2
then,
X1=∑(Xi)n=1155=23and S1=∑(Xi−X)2n−1⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯√=(25−23)2+(24−23)2+(23−23)2+(26−23)2+(17−23)
25−1⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯√S1=504⎯⎯⎯⎯
√=12.5⎯⎯⎯⎯⎯⎯⎯⎯√=3.5355
then
X2=∑(Xi)n=1095=21.8and S2=∑(Xi−X)2n−1⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯√=(22−21.8)2+(21−21.8)2+(24−21.8)2+(22−21.8)
2+(20−21.8)25−1⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯
⎯⎯⎯⎯⎯⎯⎯√S2=8.84⎯⎯⎯⎯⎯√=2.2⎯⎯⎯⎯⎯⎯√=1.4832
1
X3=∑(Xi)n=1465=29.2and S3=∑(Xi−X)2n−1⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯√=(32−29.2)2+(27−29.2)2+(35−29.2)2+(24−29.2)
2+(28−29.2)25−1⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯
⎯⎯⎯⎯⎯⎯⎯√S3=74.84⎯⎯⎯⎯⎯⎯√=18.7⎯⎯⎯⎯⎯⎯⎯⎯√=4.3243
Step 3
C.V1=SX*100%C.V1=3.535523*100%=15.3717%C.V1=15.3717%Now For city 2C.V2=SX*100%
C.V2=1.483221.8*100%=6.8036C.V2=6.8036% Now For City 3C.V3=SX*100%=4.324329.2*100%
=14.8092%C.V3=14.8092%
CHAPTER TWO
2. Summarizing of data
Objectives:
To comprehend the data easily
To facilitate comparison
To make further statistical analysis
2
2.2Types of Measures of Central Tendency
2.2.1 Mean
The Arithmetic Mean
Is defined as the sum of the magnitude of the items divided by the number of
items. The mean of X1, X2 ,X3 …Xn is denoted by A.M ,m or X and is given by:
n
x = x 1+
x 2+¿ …+ x
n
¿=
∑ xi ,where n is sample size
i=1
n
n
If we take an entire population the mean is denoted by μ and is given by:
N
μ= X 1 +
X 2 +¿…+ X N
¿=
∑ X i ,where N is population size
i=1
N
N
If X1 occurs f1 times
If X2 occurs f2 times
.
.
If Xn occurs fn times
k
∑ f i xi k
, where k is the number of classes and ∑ f i =n
i=1
Then the mean will be X = k
∑ fi i=1
i=1
3
n
The formula for the arithmetic mean for data of this type is
k
x1 f 1 + x 2 f 2 +…+ x k f k
∑ xi f i
i=1
x= = k
f 1 + f 2+ …+f k
∑ fi
i=1
∑ f i xi
i=1
X = k , where Xi =the class mark of the i th class and fi = the frequency of the i th
∑ fi
i=1
class
Example: The following frequency table gives the height (in inches) of 100 students
in a college.
Class Interval (CI) 60- 62-64 64-66 66-68 68-70 70-72 Total
62
Frequency (f) 5 18 42 20 8 7 100
Calculate the mean
Solution:
The formula to be used for the mean is as follows:
4
k
∑ xi f i
i=1
x= k
∑ fi
i=1
Let us calculate these values and make a table for these values for the sake of
convenience.
Class Interval (CI) 60-62 62-64 64-66 66-68 68-70 70-72 Total
Frequency (f) 5 18 42 20 8 7 100
Mid-Point ( x i) 61 63 65 67 69 71
f i xi 305 1134 2730 1340 552 497 6558
6
Substituting these values with ∑ f i = 100, we get
i=1
k
∑ xi f i 6558
i=1
x= =x= = 65.58
k
100
∑ fi
i=1
2. The sum of the squared deviations of a set of items from their mean is the
n n
minimum. i.e.∑ ( xi −x ) 2 ≤ ∑ ( x i− A )2 , x ≠ A
i=1 i=1
5
a) The mean of x 1 ± k, x 2± k ,..., x n± k will be x ± k
b) The mean of kx 1 , kx 2 , … , kx n will be k x .
4. If
X̄ 1 is the mean of n1 observations, if
X̄ 2 is the mean of n2 observations,
… ,if
X̄ k n
is the mean of k observation, then the mean of all the observation
in all groups often called the combined mean is given by:
k
x n + x n + …+ x k n k ∑
x i ni
xc= 1 1 2 2 = i=1k
n1+ n2 +…+n k
∑ ni
i=1
Example: If the mean of one class of 50 students are 30 and the mean of marks of
another class of 100 students are 40. What is the mean of all 150 students?
Solution: based on the above formula it is (50*30 + 100*40)/(50 + 100) =36.7.
5. If a wrong figure has been used when calculating the mean the correct mean
can be obtained without repeating the whole process using:
( CorrectValue−WrongValue )
Correct Mean =wrong mean+ , where n is total
n
number of observations.
Example: An average weight of 10 students was calculated to be 65.Latter it was
discovered that one weight was misread as 40 instead of 80 kg. Calculate the correct
average weight.
Solutions:
( CorrectValue−WrongValue )
Correct Mean =wrong mean+
n
( 80−40 )
Correct Mean =65 + = 65+4=69k.g.
10
6. The effect of transforming original series on the mean.
a) If a constant k is added/ subtracted to/from every observation then the new mean
will be theold mean± k respectively.
b) If every observations are multiplied by a constant k then the new mean will be
k*old mean
Example:
6
1. The mean of n Tetracycline Capsules X 1, X2, …,Xn are known to be 12 gm.
New set of capsules of another drug are obtained by the linear transformation
Yi = 2Xi – 0.5 ( i = 1, 2, …, n ) then what will be the mean of the new set of
capsules
Solutions:
New Mean = 2*old mean - 0.5 = 2*12- 0.5=23.5
2. The mean of a set of numbers is 500. a) If 10 is added to each of the numbers
in the set, then what will be the mean of the new set? b) If each of the numbers
in the set are multiplied by -5, then what will be the mean of the new set?
Solutions:
a New Mean = Old Mean + 10 = 500+10= 510
b. NewMean= -5*OldMean=-5*500 = -2500
Weighted Mean
7
When a proper importance is desired to be given to different data a weighted mean
is appropriate.Weights are assigned to each item in proportion to its relative
importance.
Let X1, X2, …Xn be the value of items of a series and W 1,W2, …Wn their corresponding
weights, then the weighted mean denoted X w is defined as:
n
∑ Xi W i
X w= i=1n
∑ wi
i=1
Example:
A student obtained the following percentage in an examination:
Statistics 60, Biology 75, Mathematics 63, Physics 59, and chemistry 55. Find the
students weighted arithmetic mean if weights 1, 2, 1, 3, 3 respectively are allotted to
the subjects.
Solutions:
n
If the observed values are measured as ratios, proportions or percentages and the series of
observations contains one or more unusually large values geometric mean gives a better
measure of central tendency than other means.
The geometric mean of a set of n observation is the nth root of their product.
The geometric mean of X1, X2 ,X3 …Xn is denoted by G.M and given by:
n
G. M=√ X 1∗X 2∗. . .∗X n
Taking the logarithms of both sides
1
n
log ( G . M )=log ( √ X 1∗ X 2∗.. .∗ X n )= log( X 1∗X 2∗.. .∗X n )n
1 1
⇒ log ( G . M )= log ( X 1∗X 2∗. . ..∗X n )= ( log X 1 + log X 2 +. . .+ log X n )
n n
n
1
⇒ log ( G . M )= ∑ log X i
n i=1
⇒ The logarithm of the G.M of a set of observation is the arithmetic mean of their
logarithm.
8
n
1
G . M =antilog ∑ log X i
n i=1
Example: Find the G.M of the numbers 2, 4, 8.
Solutions:
n 3 3
G. M=√ X 1∗X 2∗. . .∗X n =√ 2∗4∗8=√ 64=4
The Harmonic Mean
The harmonic mean of X1, X2 , X3 …Xn is denoted by H.M and given by:
n
k
H.M = , is called simple harmonic mean.
∑ X1
i=1 i
If observations X1, X2 ,..., Xnhave weights W1,W2 ,..., Wnrespectively, then their
harmonic mean is given by
n
∑ wi
i=1
H.M = n , this is called Weighted Harmonic Mean.
wi
∑ Xi
i=1
Remark: The Harmonic Mean is useful and appropriate in finding average speeds and
average rates.
2.2.2 Mode
9
If data are given in the shape of continuous frequency distribution, the mode is
defined as:
∆1
Mode=Lmo + ×w
∆ 1 + ∆2
w=the size of the mod al class
Δ 1=f mo −f 1
Δ 2=f mo −f 2
f mo=frequency of the mod al class
f 1 =frequency of the class preceeding the mod al class
f 2=frequency of the class following the mod al class
Note: The modal class is a class with the highest frequency.
Freq. 4 8 12 6 3 4 3 40
Solution: By inspection, the mode lies in the third class, where L =10.5, f mod= 12, f1=8,
f2=6, w = 5
Using the formula, the mode is:
∆1
Mode=Lmo + × w= 10.5 + (12-8)*5/(12-8)+(12-5) = 12.5
∆ 1 + ∆2
Demerits:
Often its value is not unique.
It is not based on all observations
Mode may not exist in the series.
It is not suitable for further mathematical treatment.
2.2.3 Median
In a distribution, median is the value of the variable which divides it in to two equal
parts. In an ordered series of data median is an observation lying exactly in the
10
middle of the series. It is the middle most value in the sense that the number of
values less than the median is equal to the number of values greater than it and
denoted by~x.
If X1, X2, …,Xn be the observations, then the numbers arranged in ascending order
will be X[1], X[2], …X[n], where X[i] is ith smallest value. i.e. X[1]< X[2]< …<X[n]
Median for ungrouped data
~
x=¿
Solution;
a. The data in ascending order is given by:
-5 0 1 2 4 5 6 8 10 15
n=10 èn is even. The two middle values are 5th and 6th observations. So the median is,
th th
10 10 th th
~ () +( +1) 5 +6 4 +5
x= 2 2 value = = =4.5
2 2
2
b. The data in ascending order is given by:
1 2 2 3 4 5 8
The middle value is the 4th observation. So the median is 3.
Median for grouped data
If data are given in the shape of continuous frequency distribution, the median is
defined as:
Median=Lmed +
w
f med ( n2 −CF )
where: Lmed =¿ the lower class boundary of the median class;
w = the class width of the median class;
f med =¿the frequency of the median class; and
CF = theless than cum. freq. corresponding to the class preceding the median
class.
Remark:The median class is the class with the smallest cumulative frequency (less than
n
type) greater than or equal to .
2
C.I 1 – 5 6 - 10 11 – 15 16 – 20 21 - 25 26 - 30 31 - 35 Total
Freq. 4 8 12 6 3 4 3 40
11
Solution: Construct the less than cumulative frequency distribution, then:
C.I 1 – 5 6 - 10 11 – 15 16 – 20 21 - 25 26 - 30 31 - 35 Total
Freq. 4 8 12 6 3 4 3 40
Cuml. Freq. 4 12 24 30 33 37 40
Since n = 40, 40/2 = 20, and the smallest CF greater than or equal to 20 is 24; thus, the
median class is the third class. And for this class, L = 10.5, w = 5, =12, CF = 12.
Quantiles are a measure which divides a given set of data in to approximately equal
subdivision and are obtained by the same procedure to that of median. They are
averages of position (non-central tendency). Some of these are quartiles, deciles and
percentiles.
12
Quartiles: are values which divide the data set in to approximately four equal parts,
denoted byQ1 ,Q 2 ∧Q3. The first quartile (Q 1) is also called the lower quartile and the
third quartile (Q 3) is the upper quartile. The second quartile (Q2) is the median.
That is, after arranging the data in ascending order, Q1, Q2, & Q3 are, obtained by:
( ) ( ) ( ) value.
th th th
1( n+1) 2(n+1) 3(n+1)
Q 1= value , Q2= value and Q3=
4 4 4
Arranged in a frequency distributionthis case also, we will follow the same procedure
as the median. That is, we construct the less than cumulative frequency distribution
and apply the formula of quartile for individual series.
i.e. Q1 = L +
w n
f Q1 4 (
−CF , ) Q2 = L + (
w 2n
f Q2 4 )
−CF ∧¿Q3 = L +
w 3n
f Q3 4
−CF ( )
Deciles: are values dividing the data approximately in to ten equal parts, denoted by
D 1 , D2 , …, D 9.
Let x1, x2 …xnbe n ordered observations. The ithdecile(D¿¿ i)¿ is the value of the item
corresponding with the [i(n+1)/10]th position, i = 1, 2, . . . ,9.
That is, after arranging the data in ascending order, D1, D2, . . .& D9 are, obtained by:
13
( ) ( ) ( )
th th th
1(n+1) 2(n+1) 9(n+ 1)
D 1= value , D2= value . . . and D 9= value.
10 10 10
Arranged in a frequency distribution this case also, we will follow the same procedure
as the median. That is, we construct the less than cumulative frequency distribution
and apply the formula of deciles for individual series.
Apply the following formula and follow the procedures of quartile for continuous data.
w ¿
D i=L+
f D 10
i
( )
−CF ,i = 1, 2... 9.
define the symbols in similar ways as we did in the case of quartiles for
Then
continuous data.
Percentiles: are values which divide the data approximately in to one hundred
equal parts, and denoted by P1 , P2 , …, P99 .
Let x1, x2 …xn be n ordered observations. The ith percentile(P¿¿ i) ¿ is the value of the
item corresponding with the [i(n+1)/100]th position, i = 1, 2, . . . ,99.
That is, after arranging the data in ascending order, P1, P2,…,& P99 are, obtained by:
( ) ( ) ( )
th th th
1(n+1) 2(n+1) 99(n+1)
P1= value , P2= value . . . and P99 = value.
100 100 100
Arranged in a frequency distribution this case also, we will follow the same procedure
as the median. That is, we construct the less than cumulative frequency distribution
and apply the formula of percentile for individual series.
❑❑
Apply the following formula
14
w ¿
Pi=L+
f P 100
i
( )
−CF ,i = 1, 2... 99.
definethe symbols similar ways as we did in the case of quartiles or deciles for
Then
continuous data.
Interpretations
1. Qiisthe value below which ( i×25)% of the observations in the series are found (w’ei = 1,
2,3). For instance Q 3 means the value below which 75% of observations in the given series
are found.
2. Diis the value below which ( i×10)% of the observations in the series are found (where i =
1, 2,...,9 ). For instance D 4 is the value below which 40% of the values are found in the
series.
3. Piis the value below which i percent of the total observations are found (where i =1,
2,3,..,99 ). For example 60 percent of the observations in a given series are below P60.
Example: Marks of 50 students out of 85 is given below. Based on the data find Q1,
D 4 a nd P7.
Marks 46-50 51-55 56-60 61-65 66-70 71-75 76-80
fi 4 8 15 5 9 5 4
Solution: - first find the class boundaries and cumulative frequency distributions.
Marks 46-50 51-55 56-60 61-65 66-70 71-75 76-80
class 45.5-50.5 50.5-55.5 55.5-60.5 60.5-65.5 65.5-70.5 70.5-75.5 75.5-80.5
boundary
fi 4 8 15 5 9 5 4
Cum. 4 12 27 32 41 46 50
Frequency
Q1 Measure of (n/4)th value = 12.5th value which lies in group 55.5 – 60.5
Q1 = L +
w n
f Q1 4 ( )5
−CF = 55.5 + ( 12.5−12 ) = 55.7
15
D4 Measure of (4n/10)th value = 20th value which lies in group 55.5 – 60.5.
D4 = L +
w 4n
f D 4 10 ( 5
)
−CF = 55.5 + ( 20−12 ) = 58.2
15
P7 Measure of (7n/100)th value = 3.5th value which lies in group 45.5 – 50.5
P7 = L +
w 7n
f P 7 100( 5
)
−CF = 45.5 + ( 3.5−0 ) = 49.875.
4
15
2.4. Measures of Dispersion (Variation)
The measures of dispersion which are expressed in terms of the original unit of a series are
termed as absolute measures. Such measures are not suitable for comparing the variability of
two distributions which are expressed in different units of measurement and different average
size. appropriate measure of central tendency and are thus pure numbers independent of the
units Relative measures of dispersions are a ratio or percentage of a measure of absolute
dispersion to an of measurement.
For comparing the variability of two distributions (even if they are measured in the same
unit), we compute the relative measure of dispersion instead of absolute measures of
dispersion.
Various measures of dispersions are in use. The most commonly used measures of
dispersions are:
1) Range and relative range
2) Quartile deviation and coefficient of Quartile deviation
3) Mean deviation and coefficient of Mean deviation
4) Standard deviation and coefficient of variation.
16
The range is the largest score minus the smallest score. It is a quick and dirty measure of
variability, although when a test is given back to students they very often wish to know the
range of scores. Because the range is greatly affected by extreme scores, it may give a
distorted picture of the scores. The following two distributions have the same range, 13, yet
appear to differ greatly in the amount of variability.
Distribution 1: 32 35 36 36 37 38 40 42 42 43 43 45
Distribution 2: 32 32 33 33 33 34 34 34 34 34 35 45
For this reason, among others, the range is not the most important measure of variability.
R= L-S, L = largest observation
S = smallest observation
If data are given in the shape of continuous frequency distribution, the range is computed as:
R=UCLk - LCL1,UCLkis upperclass limit of thelastclass
UCL1is lower class limit of the first class
This is sometimes expressed as:
R = Xk – X1,Xk is classmarkof thelastclass
X1 is classmarkof thelastclass.
17
L−S R
RR= =
L+ S L+S
Variance
The variance is the average of squared deviations from the mean. Recall that the sum
of squared deviations is minimum only when taken from the mean.
a) Population Variance (σ 2)
If we divide the variation by the number of values in the population, we get
something called the population variance.
For ungrouped data (individual series) for population data
N
∑ (X i −μ)2
[∑ ]
N
2 i=1 1 , where μ is the population arithmetic mean
σ = = X i2−N μ2
N N i =1
σ =
2 ∑ f i ( X i−μ)2 = 1 [ ∑ f X i −N μ ],where X i the class mark of the ith class is, f iis the
2 2
i
N N
frequency of theith class and N=∑ f i
18
One would expect the sample variance to simply be the population variance with the
population mean replaced by the sample mean. However, one of the major uses of
statistics is to estimate the corresponding parameter. This formula has the problem
that the estimated value isn't the same as the parameter. To counteract this, the sum
of the squares of the deviations is divided by one less than the sample size, this is
because to get unbiased estimator.
n
∑ ( xi −x)2
[∑ ]
n
1
S2= i=1 = xi2−n x2
n−1 n−1 i=1
If the values have frequencies fi (i=1,2,…,m), then the sample variance is given by:
2
S=
∑ f i ( x i−x)2 = 1
[ ∑ f i x i −n x ].
2 2
n−1 n−1
xi 2 4 5 6 8
fi 2 2 3 1 2
Solution: Prepare the following table:
xi fi fixi xi2 fixi2
2 2 4 4 8
19
4 2 8 16 32
5 3 15 25 75
6 1 6 36 36
8 2 16 64 128
Sum 10 49 279
[∑ ]
n
2 1
S= f i x i2−n x 2
n−1 i=1
=
1
9 [ 49 2 1
]
279−10( ) = ( 38.9 )=4.32 , andS=√ 4.32=2.08 .
10 9
1. If a constant is added to (or subtracted from) all the values, the variance remains the
Examples:
1. The mean and standard deviation of n Tetracycline Capsules X1, X 2,...,X nare known
to be 12gm and 3gm respectively. New set of capsules of another drug are obtained by the
linear transformation Yi = 2Xi – 0.5 (i = 1, 2,…, n ) then what will be the standard deviation
of the new set of capsules
2. The mean and the standard deviation of a set of numbers are respectively 500 and 10.
a. If 10 is added to each of the numbers in the set, then what will be the variance and
standard deviation of the new set?
b. If each of the numbers in the set are multiplied by -5, then what will be the
variance and standard deviation of the new set?
20
Solutions:
1. Using above the new standard deviation =|k|S=2*3 =6
2.
a) They will remain the same.
b) New standard deviation= |k|S =5*10= 50
21
Solution:
SM 25
C . V M= ∗100= × 100=29.41 %
XM 85
SCi 12
C . V Ci = ∗100= ×100=18.46 %
X Ci 65
The Z-scoreis the number of standard deviations that a given value X is below or above the
mean and values above the mean have positive z-scores and values below the mean have
negative Z-scores. The numerical value of the Z-score reflects because of this Z-score is also
referred to as relative measure of relative standing.
Scores are generally meaningless by themselves unless they are compared to the
distribution or scores from some reference group.
In addition to comparison the data sets it is useful to transform a given data sets in to a
standard normal distribution.
22
X −X
Z score=
S
Example: What is the Z-score for the value of 14 in the following sample data set?
3 8 6 14 4 12 7 10
Solution:
14−8
X = 8, SD = 3.8173 thus, Z = ≈ 1.57.
3.8173
The data value of 14 is located 1.57 standard deviations above the mean 8 because the z-
score is positive.
Example: Suppose that a student scored 66 in Statistics and 80 in Mathematics. The score of
the summary of the courses is given below.
Course Average score Standard deviation of the score
Statistics 51 12
Mathematics 72 16
In which course did the student scored better as compared to his classmates?
Solution:
X−μ 66−51 15
Z-score of student in Statistics: Z= = = =1.25
σ 12 12
X−μ 80−72 8
Z-score of student in Mathematics: Z= = = =0.5
σ 16 16
From these two standard scores, we can conclude that the student has scored better in
Statistics course relative to his classmates than in Mathematics course.
Exercise
1. Two groups of people were trained 100km race and tested to find out which group is faster
to complete the race. For the two groups the following information was given:
Value Group one Group two
Mean 10.4 min 11.9 min
Stan.dev. 1.2 min 1.3 min
Relatively speaking:
23