Introduction To Statistics
Introduction To Statistics
Introduction To Statistics
OBJECTIVES:
DISCUSSION:
Probability and statistics, the branches of mathematics concerned with the laws governing
random events, including the collection, analysis, interpretation, and display of numerical data.
Descriptive statistics summarize and organize characteristics of a data set. A data set is a
collection of responses or observations from a sample or entire population.
In quantitative research, after collecting data, the first step of statistical analysis is to describe
characteristics of the responses, such as the average of one variable (e.g., age), or the relation
between two variables (e.g., age and creativity).
The next step is inferential statistics, which help you decide whether your data confirms or
refutes your hypothesis and whether it is generalizable to a larger population.
-The variability or dispersion concerns how spread out the values are.
Research example
You want to study the popularity of different leisure activities by gender. You distribute a survey
and ask participants how many times they did each of the following in the past year:
Go to a library
Your data set is the collection of responses to the survey. Now you can use descriptive statistics
to find out the overall frequency of each activity (distribution), the averages for each activity
(central tendency), and the spread of responses for each activity (variability).
Frequency distribution
A data set is made up of a distribution of values, or scores. In tables or graphs, you can
summarize the frequency of every possible value of a variable in numbers or percentages.
Simple FDT
For the variable of gender, you list all possible answers on the left hand column. You count the
number or percentage of responses for each answer and display it on the right hand column.
Gender Number
Man 182
Woman 235
No answer 27
From this table, you can see that more women than men took part in the study.
Grouped FDT
In a grouped frequency distribution, you can group numerical response values and add up the
number of responses for each group. You can also convert each of these numbers to
percentages.
0–4 6%
5–8 20%
9–12 42%
13–16 24%
17+ 8%
From this table, you can see that most people visited the library between 5 and 16 times in the
past year.
Measures of central tendency are measures indicating the center of a set data which are arranged in
order of magnitude. It is described as the point about which the scores tend to cluster, hence, regarded
as a sort of average in the series. It is the center of the concentration of the scores. It is a single number
which describes the totality of the set of data collected. It refers to the parameters of the sample.
There are three measures of central tendency commonly used. These are: the mean, median and
mode.
1. Mean or arithmetic mean ( or average) is the most popular and well known measure of central
tendency. It can be used with both discrete and continuous data (although it is used most often which
continuous data).
For Ungrouped data: The mean is the most frequently used measure of central tendency. The mean is
denoted by a symbol µ (read as “mu’’) and X̅̅ (read as “x- bar) for the population and sample respectively.
The mean of a series of values is equal to the sum of the set of values divided by the number of values.
Symbolically the mean is:
∑�� �1 + �2 + �3 + … + ��
�= =
� �
where:
µ-population mean
− ∑�� �1 + �2 + �3 + … + ��
�= � = �
where:
−
� – sample mean
Example 1: The items listed below represents the scores of seven BS Applied Statistics
Students during the final examination. Compute the mean score. 89,75,90,85, 78,87 and 80.
− ∑�� 89 + 75 + 90 + 85 + 78 + 87 + 80 584
�= � = 7
=
7
= 83.43
∑�� 170 + 165 + 155 + 160 + 150 + 149 + 152 + 161 + 163 + 175 1600
�= = = = 160��
� 10 10
For Grouped Data: The mean for grouped data is denoted by µg or XG for population and sample
respectively
∑�1 �1 �1 �1 + �2 �2 + �3 �3 … + �� ��
�� = =
� �
Where:
µG – mean
Xi – class mark of ith class
Fi – frequency of ith class
∑ - sum of all values
N – Total numbers of observations
Example: The table below represents the scores of 64 students in a long quiz.
Class Interval Frequency Class Mark fiXi
5-9 7 7 49
10-14 10 12 120
15-19 13 17 221
20-24 18 22 396
25-29 8 27 216
30-34 5 32 160
35-39 3 37 111
Total N=64 ∑fiXi=1273
N=f1+f2+f3+....+fk
=7+10+13+18+8+5+3
=64
∑fiXi=f1x1+ f2x2+ f3x3+.......+ fkxk
=7(7)+10(12)+13(7)+18(22)+8(27)+5(32)+3(37)
=1273
∑�1 �1 1273
�� = = = 19.89
� 64
Note: The frequency distribution table add another column to represent the product of the ith frequency
and ith class mark. Then take the sum under this column.
Weighted Mean
µi-weighted mean
Xi-ith quantity
Wi-weight of ith mean
∑-sum of all values
Example: Consider the grade of a freshman student during the first semester.
1.The sum of the deviations of the observations from the mean is zero. The deviation of it observation
from the mean is denoted by
di = Xi -µ
d1 = X1 – 5 = 3 – 5 = -2
d2 = X2 – 5 = 8 – 5 = 3
d3 = X3 – 5 = 4 – 5 = -1
∑ di = d1 + d2 + d3 = -2 + 3 (-1) = 0
2. The sum of the squared deviations of the observations from the mean is minimum.
Hence, the sum of the squared deviation of the observation from the mean has minimum value.
3. The mean reflects the magnitude of every observation, since every observation contributes to the value
of the mean.
4. The mean can be easily affected by the presence of an extreme value, hence not a good measure of
central tendency when an extreme value to occur.
5. The mean of subgroups may be combined when properly weighted, the combined mean is called the
weighted mean.
2. Median is the middle score for a set of data arranged in order of magnitude. Median is best used when
data has several extreme entries.
For Ungrouped data: The median is defined as the middle value when a set of observed values have
been arranged in either ascending (from lowest to highest value) or descending from highest to lowest
value) order of magnitude. The median is the center most array into two equal parts, that is 50% of the
total number of observation is less than the median value while the other 50% is greater than the median
value. The median is denoted by Md.
Symbolically, a given set of data is denoted by X1, X2, . . ., Xn the array is denoted by X(1), X(2)+, . . .
X (N). The median is
�(�+1)/2
� �+2 + � �+2 +1
Md= -------------------------
2
Example 1. The items listed below present the scores of seven graduate students during the final
examination. Compute the median score. 89, 75, 90, 85, 78, 87, and 80.
Example 2. Suppose MA Math program has 10 graduate students and the height (in cm) are as
follows: 170, 165, 155, 160, 150, 149, 152, 161, 163, 175. Find the median height of the graduate
students.
x(1) = 149 x(2) = 150 x(3) = 152 x(4) = 155 x(5) = 160
x(6) = 161 x(7) = 163 x(8) = 165 x(9) = 170 x(10) = 175
where
Lmd - Lower CB of the median class
C - Class size
Fb - <CF before the median class
N - total number of observations
Fmd - frequency of the median class
Note: Median class is the class containing the middle value. It is the class which contain the (N/2)th
observation. This can be easily identified under the less than cumulative frequency column.
Example: The table below presents the scores of 64 students in a long quiz
Class Interval Frequency Class Mark Class <CF Array
Boundary
5-9 7 7 4.5 – 9.5 7 X(1), X(2), . . X(7)
10-14 10 12 9.5 – 14.5 17 X(8), X(9), . . X(17)
15-19 13 17 14.5 – 19.5 30 X(18), X(19), . X(30)
20-24 18 22 19.5 – 24.5 48 X(31), X(32), . X(48)
25-29 8 2 24.5 – 29.5 56 X(49), X(50), . X(56)
30-34 5 27 29.5 – 34.5 61 X(57), X(58), . X(61)
35-39 3 32 34.5 – 39.5 64 X(62), X(64), . X(64)
Total N=64 37
The middle value is the X(32) observation and it falls under the class interval 20 – 24 .
64
− 30
��� = 19.5 + 5 2 = 20.0555
18
Properties of Median
1. Median is a positional value and hence is not affected by the presence of an extreme value unlike the
mean.
2. The sum of the absolute deviation from a point say “a” is minimum when a = Md, that is ∑ │Xi – Md │is
minimum.
3. The Median is not amenable for further computation and hence medians of subgroups cannot be
combined in the same manner as the mean.
4. The median of grouped data can be calculated even with open-ended intervals provided the median is
not open-ended.
3. Mode is the most frequent score in the data set. It is sometimes considered as the most popular option.
Example 1. Consider the data set 1, 2, 2, 2, 8, 1, 4, 10. The most frequently occurring observation is 2
which appeared thrice. Thus, the mode is 2, and since there is only one mode, then the distribution is
unimodal.
Example 2. Suppose BS Applied Statistics has 10 students and the height (in cm) are as follows: 170,
165, 155, 160, 150, 161, 163, 175.
Since all values occur with equal frequency, then this data has no mode.
Example 3. Result of the survey of the color of cars owned by faculty shows that 40 were white, 20 blue
and 10 were red. The modal color of cars owned by faculty is white.
for Grouped data: The mode grouped data can be approximated using the formula
��� − ��
��� = ��� + �
2��� − �� − ��
where
Lmo - Lower CB of the modal class
C - Class size
fb - frequency before the modal class
fa - frequency after the modal class
Example: The table below represents the scores of 64 students in a long quiz.
18 − 13
��� = 19.5 + 5 = 21.17
2(18) − 13 − 8
Classes are mutually exclusive categories defining th lower limit and the upper limit with
equal intervals.
Class frequency is the number of observations in each class
Class mark or class midpoint is used in computing the mean and some measures of
variability.
Cumulative frequency tells the sum of frequencies in a particular class of interest.
Relative frequency tells the percentage of observations in a particular class of interest.
2. Determine, the number of classes of K to which the data re to be grouped using the
Sturges Approximation:
K= 1+3.322Log N
Where N = total number of values to be grouped
Remarks:
144 112 156 122 168 172 141 159 127 154
156 145 134 137 123 149 144 160 136 139
142 138 159 151 147 150 126 152 147 136
135 132 146 133 150 122 139 149 152 129
131 155 116 140 145 135 160 125 172 163
Measures of Dispersion
Measures of dispersion identify how a set of values spreads or fluctuates. The measures of
dispersion are the range, the mean absolute deviation or variance, the standard deviation, the coefficient
of variation, the coefficient of skewness and the boxplot.
Range is the simplest measure of dispersion. It is the difference between the highest and lowest score. It
actually does not reflect the variations in the data that lie in between the highest and the lowest scores;
therefore, it is not considered to be a valid measure of variability and spread ability.
for Ungrouped data: The range of a set of data is the absolute difference between the highest and the
lowest value in the set. The range is denoted by R.
R = │HV - LV│
where
R - Range
HV – Highest value
LV - Lowest value
Example 2. Suppose BS Applied Math program has 10 students and the height (in cm) are as follows:
170, 165, 155, 160, 150, 149, 152, 161, 163, 175. Find the range of height of the BSAM students.
RG = │ULHC – LLLC │
where:
R - Range
ULHC - Upper Limit of the Highest Class
LLLC - Lower Limit of the Lowest Class
Example 3. The table below represents the scores of 64 students in a long quiz.
2. Mean absolute deviation, also known as variance, is the simplest method of taking into account the
variations or the spread ability of all items into a series from the point of central tendency.
The variance considers the position of each observation relative to the mean. The variance of a
given data set is the average of the sum of the square deviation of the observation from the mean. The
variance from the population is denoted by σ2 (read as “ sigma square”) and s2 (read as “ s-square”) for
the sample.
For Ungrouped data: Given the set of values X1, X2, X3, . . ., XN. The deviation of ith observation from the
mean is X1 - µ.. The population variance, σ2, is
( Xi )
( X 1 ) 2 ( X 2 ) 2 ( X 3 ) 2 ... ( Xn ) 2
N N
The computational formula of the variance is
2
X 12 X 2 2 X 32 ..... Xn 2
2
N N
Example 4. The following data p esent the score of 7 BS Applied Statistics in a quiz:
r
X1=4, X2=7, X3=8, X4=2, X5=2, X6=9, X7=3.
478 2 293
5
7
2
(X i )2
(X 1 5) 2 ( X 2 5) 2 ( X 3 5) ... ( X 7 5) 2
7 7
( 4 5 ) ( 7 5 ) (8 5 ) ( 2 5 ) 2 ( 9 5 ) 2 ( 3 5 ) 2
2 2 2
2
7
2 7.42857
Using the computational formula
X
2
2 i 4 2 7 2 8 2 2 2 2 2 9 2 32
2 52 7.43
N 7
Note: Using the definitional or computational formula the population variance is the same.ut the
computational is faster and easier to apply than the definitional formula.
s 2 ( X i X )2 ( X X ) 2 ( X 2 X ) 2 ( X 3 X ) 2 ... ( X n X )
1
n 1 n 1
n X i ( X i ) 2
2
2
s
n(n 1)
s 2 (X i X )2
(4 4.9) 2 (7 4.9) 2 (8 4.9) 2 ... (7 4.9) 2
7 .6
n 1 9
Using the computational formula
X
2 2 2 2 2
i x1 x2 x3 ... xn 4 2 7 2 82 ... 7 2 305
n X i ( X i ) 2
2
2 10(360) (54) 2
s 7 .6
n(n 1) 10(10 1)
for Grouped data: The variance from the grouped data can be obtained using the formula.
fX
2 2 2 2 2
2 i i 2 f1 X 1 f 2 X 2 f 3 X 3 ... f k X k 2
G G G
N N
n f i X i ( f i X i ) 2
2
2
S
n(n 1)
where
fi - the frequency of the ith class
Xi - the class mark of the ith class
µG – the mean from the grouped data
Example 6: The table below represent the scorer of 64 students in along quiz.
fX
2
2 i i 2 29311 1273 2
G G ( ) 62.347412 62.35
N 64 64
Standard deviation is based on the deviations of all the scores in a series. It is always computed from
the mean. The standard deviation is defined as the positive square root of the variance. Hence the
variance is denoted by the σ for the population standard deviation and s for the sample standard
deviation.
b
2
2 G G
2 2
S S SG SG
Example 7. Using the data in example 4. Compute the population standard deviation.
From example 4, the population variance was 7.43, then the population standard is σ = 2.7258.
Example 8. Using the data in example 5. Compute the sample standard deviation.
From example 5, the sample variance was 7.21, then the sample standard deviation is s =
2.68514.
Example 9. Using the data in example 6. Compute the population and sample standard deviation.
From example 6, the population variance was 63.34, then the sample standard deviation is SG =
7.958.
The properties of standard deviation have the same properties with the variance except property
The unit of measure of the standard deviation is the same as the unit of measure of the raw data.
Coefficient of variation, also known as relative dispersion, is the ratio of the standard deviation and the
mean and is usually expressed in percent; i.e.,
� �
CV= � x 100 or CV= µ
x 100
The coefficient of variation is a unitless measure of dispersion, hence it can be used to compare
variability of two or more groups of data measured in the same or different units.
Skewness is a measure or a criterion on how asymmetric the distribution of data is from the mean.
Positive skewness indicates a distribution with an asymmetric tail extending toward the right side of the
distribution while negative skewness indicates a distribution with an asymmetric tail extending toward the
left.
4, 7, 8, 2, 8, 8, 9, 2, 5, 7
Using the measure of central tendency, tell whether the given data are symmetric,
skewed to the left, or skewed to the right.
Since the Mean < Median < Mode, therefore it is negatively skewed.
The formula for the coefficient of the Pearsonian skewness, denoted by SK, is
3( Md )
SK
where
Example 10: The following data represent the score of 7 BS Applied Statistics students in a quiz:
X1 =4, X2 =7, X3 = 8, X4 = 2, X5 = 2, X6 = 9, X7 = 3.
3(5 4)
SK 1.0989 1.10
2.73
Hence,positively skewed distribution
Example12: Using the data from the Frequency Distribution Table in example 6, compute the
coefficient of skewness
3(19.89 20.06)
SK 0.2697 0.27
7.897
Univariate descriptive statistics
Univariate descriptive statistics focus on only one variable at a time. It’s important to examine
data from each variable separately using multiple measures of distribution, central tendency and
spread. Programs like SPSS and Excel can be used to easily calculate these.
N 6
Mean 9.5
Median 7.5
Mode 3
Variance 84.3
Range 24
If you were to only consider the mean as a measure of central tendency, your impression of the
“middle” of the data set can be skewed by outliers, unlike the median or mode.
Likewise, while the range is sensitive to extreme values, you should also consider the standard
deviation and variance to get easily comparable measures of spread.
In bivariate analysis, you simultaneously study the frequency and variability of two variables to
see if they vary together. You can also compare the central tendency of the two variables before
performing further statistical tests.
Multivariate analysis is the same as bivariate analysis but with more than two variables.
Contingency table
In a contingency table, each cell represents the intersection of two variables. Usually, an
independent variable (e.g., gender) appears along the vertical axis and a dependent one
appears along the horizontal axis (e.g., activities). You read “across” the table to see how the
independent and dependent variables relate to each other.
Men 32 68 37 23 22
Women 36 48 43 83 25
Interpreting a contingency table is easier when the raw data is converted to percentages.
Percentages make each row comparable to the other by making it seem as if each group had
only 100 observations or participants. When creating a percentage-based contingency table,
you add the N for each independent variable on the end.