Chapter One1
Chapter One1
Definitions:
Statistics: Is the science of collecting, organizing, presenting, analyzing, and interpreting data to assist in
making more effective decisions
Descriptive statistics are methods of organizing, summarizing, and presenting data in an informative
way.
Inferential statistics: The methods used to determine something about a population on basis of the
sample.
Types of variables
Quantitative
Qualitative
Continuous
e.g Discrete
• Brand of PC
• Marital status,
• hair color • Children in a family • Amount of income
• Strikes in a golf hole tax paid
• TV sets owned • Weight of a student
• Yearly rainfall
• Temperature
1
Levels of measurements
Data can be classified according to levels of measurements. The level of measurement of the data
often dictates the calculations that can be done to summarize and present the data. It also determines
the statistical tests that should be performed.
Levels of Measurement
Definition:
A frequency distribution is grouping of data into mutually exclusive classes showing the number of
observations in each.
2
Step 1: Decide on the number of classes (k): Use “2 to the k rule”
This rule suggests that you select the smallest number (k) for the number of classes such that 2k is
greater than the number of observations ( n ) .
2k > n
Generally the class width should be the same for all classes. The class width is determined using the
H −L
following formula: i ≥
k
Where i is the class width, H is the highest observed value, L is the lowest observed value, and k
is the number of classes.
Definition
Three charts that will help portray a frequency distribution graphical are histogram, frequency
polygon, and cumulative frequency polygon.
HISTOGRAM
A graph in which the classes are marked on the horizontal axis and the class frequencies on the
vertical axis. The classes frequencies are represented by the heights of the bar, and bar are drawn
adjacent to each other.
Example
3
FREQUENCY POLYGON
A frequency polygon consists of line segments connecting the points formed by intersections of the
class midpoints and the frequencies.
4
CUMULATIVE FREQUENCY DISTRIBUTIONS AND CUMILATIVE FREQUENCY POLYGON .
5
Cumulative frequency polygon (OGIVE)
LINE GRAPHS
Line charts are particularly effective for business and economic data because they show change or
trends in variable over time.
Example:
Year 1992 1993 1994 1995 1996 1997 1998 1999 2000
Unemployment_rate 17.8 13.7 11 10.2 11.3 9.1 8.8 8.5 7.9
6
7
PIE CHARTS
BAR CHART
8
1.2. Measures of central location
∑x
i =1
i
=
n
Example:
For 10 years a company declared its percentage dividends as follows:
year 1 2 3 4 5 6 7 8 9 10
Dividend(xi) 5 6 14 20 30 10 15 20 20 30
Calculate the average dividend of the percentage declared By the company during the 10 years
Solution
∑fx
i =1
i i
The AM a discrete frequency is calculated x = k
∑f
i =1
i
Solution
9
n
∑fx
i =1
i i
The AM a discrete frequency is calculated x = k
∑f
i =1
i
Where the xi is the class mid-point value of the ith class and
fi is the number of observations falling the ith class.
Example:
The following frequency distribution summarizes data on service times in minutes at the
checkout counter of a supermarket.
Time
interval Customers
1.99-<2.50 3
2.50-<3.00 8
3.00-<3.50 23
3.50-<4.00 10
4.00-<4.5 6
Calculate the estimated average time a customer takes for a checkout at the counter in this
supermarket.
Solution
10
2. Median (Mdn/Md)
The median is defined as the middle value when the data set are arranged in ascending order. It divides
the data set into two equal parts.
Steps:
3. If the number of observations ( n ) in the data set is even, then the median is given by the
n n
average of values in positions and + 1
2 2
Example:
Steps:
11
Example:
Number
Annual profit outlets
10 3
15 8
20 23
25 10
30 6
Calculate the median for the annual profit.
Calculating the median for grouped frequency distribution with equal intervals
Steps:
h n
M d =I M d + −F
f 2
Where,
12
n is the total cumulative frequency
F is the cumulative frequency of the class immediately before the median class.
Example:
The following frequency distribution summarizes data on service times in minutes at the checkout
counter of a supermarket.
Time
interval Customers
2.00-<2.50 3
2.50-<3.00 8
3.00-<3.50 23
3.50-<4.00 10
4.00-<4.5 6
Calculate the median for the time it takes for a customer to be checked out at counter in this
supermarket.
Solution
Step 1
Time Customers
interval (fi) Fi
2.00-<2.50 3 3
2.50-<3.00 8 11
3.00-<3.50 23 34
3.50-<4.00 10 44
4.00-<4.5 6 50
Step 2:
n 50
= = 25
2 2
n
Step 3: the cumulative frequency equal to or just greater than is 34
2
Step 4:
13
h n
Step 5: The median is found by using the interpolation formula M d =I M d + −F
f 2
F =11 is the cumulative frequency of the class immediately before the median class.
0.5 50
M d =+
3 − 11 =3.30
23 2
3. Mode ( M o )
The mode of a data set is the value in the data set that occurs most with the greatest frequency.
It is a data point that occurs most frequently in the measurements that constitute a data set.
To find the mode of ungrouped data set we simply observe the data value that occurs most
frequently in the data set.
14
Example:
Example:
Number
Annual profit outlets
10 3
15 8
20 23
25 10
30 6
Calculate the mode for the annual profit.
Solution
f1 − f 0
Mo =
lM o + ×h
( f1 − f0 ) + ( f1 − f 2 )
f1 − f 0
=lM o + ×h
2 f1 − f 0 − f 2
Where
M o is the mode
15
Definition:
Example
The following frequency distribution summarizes data on service times in minutes at the
checkout counter of a supermarket.
Time
interval Customers
2.00-<2.50 3
2.50-<3.00 8
3.00-<3.50 23
3.50-<4.00 10
4.00-<4.5 6
Calculate the mode for the time it takes for a customer to be checked out at counter in this
Solution:
f 2 =10 is the frequency of the class immediately after the modal class
Therefore,
23 − 8
Mo =
3+ × 0.5
( 23 − 8) + ( 231 − 10 )
=3.27
16
MEASURES OF DISPERSION
Partition values are values of a variable that divide a data set into a number of equal parts
e.g. Quartiles, Percentiles, deciles
17
1. Quartiles
Quartiles of a data set are values (partition values) that divide the data set into four equal parts
when data are arranged in ascending order.
There are three quartiles called lower quartile ( Q1 ), the middle quartile (second quartile Q2 ),
and upper quartile ( Q3 ).
Calculating quartiles from frequency distributions
To calculate the kth quartile from grouped frequency distributions, we use the following
procedure:
Step 1: Construct less than cumulative frequency distribution.
k
Step 2: Calculate nk= ×n
4
For Q1 , the value of k=1
For Q2 , the value of k=2
For Q3 , the value of k=3
k
Step 3: Find the cumulative frequency equal to or just greater than the value of × n calculated
4
in step 2.
Step4: The kth quartile class is the class at which the cumulative frequency corresponds to the
cumulative frequency in step 3.
Step 5: The kth quartile class is calculated using the following interpolation formula:
h k
Qk = lk + ×n− F
fk 4
Where
Qk is the kth quartile for the data set;
lk is the lower class limit of the kth quartile class;
h is the width of the kth quartile class;
f k is the frequency of the kth quartile class;
F is the cumulative frequency of the class immediately before the the kth quartile class;
Example:
18
3-<7 14
7-<11 22
11-<15 11
15-<19 6
19-<23 3
Determine the first quartile, the second quartile, and the third quartile.
2. Percentiles
The percentiles of a data set are values of a random variable dividing a data set into hundred
equal parts, with each containing 1% of values when the values are arranged in ascending order.
There ninety-nine percentiles called first percentile, second percentile,…, and ninety-ninth
percentile.
The fiftieth percentile is the median of the data set
The 25th percentile is the 1st quartile,
And 75 th percentile is 3rd quartile
To calculate the kth percentile from grouped frequency distributions, we use the following
procedure:
Step 1: Construct less than cumulative frequency distribution.
k
Step 2: Calculate =
nk ×n
100
For p1 , the value of k=1
For p2 , the value of k=2
For p3 , the value of k=3
.
.
.
For p99 , the value of k=99
k
Step 3: Find the cumulative frequency equal to or just greater than the value of ×n
100
calculated in step 2.
19
Step4: The kth percentile class is the class at which the cumulative frequency corresponds to the
cumulative frequency in step 3.
Step 5: The kth percentile is calculated using the following interpolation formula:
h k
pk = lk + ×n− F
f k 100
Example:
3-<7 14
7-<11 22
11-<15 11
15-<19 6
19-<23 3
Determine the 65th percentile, the 70th percentile, and the 90th percentile
20
spread are the range, Inter-quartile range, semi-quartile (Quartile deviation) variance, and standard
deviation, and coefficient of variation.
1. Range
The range is the difference between the highest and lowest values in a data set.
Example:
18 26 17 10 7 27 24 17 17 23 29 28
18 10 23 16 9 12 26 5 12 23 22 24
16 5
xmax = 29
xmin = 5
Range = 29 − 5 = 24
Definition
Quartiles of a data set are values (partition values) that divide the data set into four equal parts when
data are arranged in ascending order.
There are three quartiles called lower quartile, the middle quartile (second quartile), and upper quartile.
= Q3 − Q1
IQR
21
Q3 − Q1
SIQR(Q.D) =
2
Example:
Let
Q1 = 14.5days
Q2 = 18.89days
Q3 = 23.93days
23.93 − 14.5
SIQR(Q.D) =
2
=4.715 days
Interpretation: 50% of all observations are expected to lie within 4.715 days either side of the
median of 18.89 days. Or 25% of observations are considered to lie within 4.715 days below the
median and 25% of observations are expected to lie within 4.715 days above the median value.
Because the variance is such a measure that satisfies these properties, it has become the most
commonly used measure of dispersion. It is extensively used in statistical analysis.
For ungrouped data, the variance is calculated using the following formula:
nn
∑(x − x ) ∑x − nx 2
2 2
i i
==
S 2
=
i 1 =i 1
n −1 n −1
x
∑ f (x − x) ∑fx − nx 2
2 2
i i i i
=
=
S 2 i 1 =i 1
=
n −1 n −1
x
The variance is a measure of average of sum squared deviation about the arithmetic mean. It is
expressed in squared units. Consequently, its meaning in practical sense is obscure.
22
Because of this interpretation problem, a measure that uses original units is derived from the
variance: Standard deviation.
5. Standard deviation
Sx = Sx2
The standard deviation describes how observations are spread about the mean.
6. Coefficient of variation
Sometimes, it is necessary to compare the samples of data from different random variables to
establish which sample data shows greater variability. A direct comparison of their respective
standard deviations would be misleading as the random variables may be measured in different
units. Thus, a meaningful comparison should be based on measure variability expressed in the
same units. This achieved by producing a measure of relative variability (i.e. relative to their
mean) expressed in percentage terms, called coefficient of variation.
Sx
=
CV ×100%
x
Example
TUTORIAL 1
23
1. A company employs 12 persons in managerial positions. Their seniority (in years of service)
and sex are listed below:
Sex F M F M F M M F F F F M
Seniority (yrs) 8 15 6 2 9 21 9 3 4 7 2 10
Find the seniority mean, the seniority median and the seniority mode for the above data set.
2. The daily percentage change (to the nearest percentage ) of equity traded on the JSE was
monitored for 100 days by an investment analyst. These daily percentage changes were
summarized into the frequency distribution below.
Daily
percentage
change of
an Number
equity(%) of days
2 15
3 30
4 25
5 19
6 8
7 2
8 1
Find the mean daily percentage change, the median daily percentage change, and mode
daily percentage change.
24
Number of
Hourly
earnings(Rands) Women Men
4.70-4.90 6 5
4.90-5.10 31 16
5.10-5.30 15 25
5.30-5.50 29 30
5.50-5.70 19 24
Calculate the mean, median and the mode of the hourly earnings for the men
4. The annual earnings of a company’s salesmen at its Johannesburg and Cape Town offices
are as follows:
Number of salesmen
Cape
Earnings(R1000s) Johannesburg Town
6-<8 3 2
8-<10 7 3
10-<12 13 6
12-<14 17 8
14-<16 4 3
16-<20 4 2
20-<25 2 6
(a) Compare the salesmen’s earnings in Johannesburg and Cape Town offices by find the
means, medians and quartile deviations
(b) Find the standard deviation
25