0% found this document useful (0 votes)
32 views99 pages

Biosa

The document outlines the fundamentals of statistics, focusing on variables, their types, and methods for data presentation. It details qualitative and quantitative variables, various graphical representations, and the construction of frequency tables. Additionally, it covers measures of central tendency, including mean, median, and mode, along with their calculations and applications in data analysis.

Uploaded by

Jatin Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views99 pages

Biosa

The document outlines the fundamentals of statistics, focusing on variables, their types, and methods for data presentation. It details qualitative and quantitative variables, various graphical representations, and the construction of frequency tables. Additionally, it covers measures of central tendency, including mean, median, and mode, along with their calculations and applications in data analysis.

Uploaded by

Jatin Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

RAVI PRAKASH JHA

Deptt. of Community Medicine


Dr. BSA Medical College
LEARNING OBJECTIVES

Know the definition of a variable


Distinguish between the main types of variables
and scales of measurement
Summarize the variables
Know what tabular and graphical display is
appropriate for each type of variable
– Summary

2
Introduction
• Statistics – a set of tools for working with data
• Data – consists of discrete observations of
attributes or events that carry little meaning when
considered alone
• Information – is when data is transformed by
summarizing, adjusting making it meaningful
• Variable – any information that can vary
• Quantitative – given in numbers
• Qualitative – given in words
Ratio &Interval
QUALITATIVE VARIABLES

NOMINAL variables: sex, pat category …


– objects receive a name
– comparison of equality
– no order in the data

ORDINAL variables: lab results, smoking


– objects receive a name
– comparison of > or <
– inherent order in the data
– distance between classes = unequal
5
Common methods of data presentation

*Tables – simple, frequency distribution

* Bar charts – simple, multiple, component

* Histogram, frequency polygon, line diagram

* Frequency graphs – straight line, cumulative

* Pie charts

* Scatter diagrams

* Maps
Tabulation of data

• Table is systematic arrangement of data in columns


(vertical) and rows (horizontal)

• Simplifies presentation, facilitates comparisons

• Helps present huge data in an orderly manner within


minimum space

BASIC STEP:

• Classify the data- bring different items having


common characteristics together
Parts of a table

Table no: Title, (head note)

Stub heading Caption


Stub entries Body

Footnote, Source
Tallying
Frequency Table
Table : Distribution of test scores of study population (n=50)

Test Score Frequency

0 - 19 2
20 - 39 11
40 - 59 9
60 - 79 11
80 - 99 8
100-119 7
120-139 2
Guidelines for constructing frequency tables

1. The classes must be "mutually exclusive"—no element can belong


to more than one class.
2. Even if the frequency is zero, include each and every class.
3. Make all classes the same width.
4. Target between 5 and 20 classes, depending on the range and
number of data points.
5. Keep the limits as simple and as convenient as possible
6. Avoid irrelevant decimal places (up to first decimal).
7. Rows are usually more than columns
8. Indicate ‘0’ value for ZERO only; for unavailable data write NA or
DASH ( -).
9. Dummy tables (skeleton only) are preferably made before entering
actual data
Diagrams - construction
1. Title: a suitable title, few words, clearly indicating
main idea at top or below diagram
2. Width and height: no fixed rule for proportion
3. Scale : values should preferably be even numbers or
multiples of five : 5,10, 15 …
- 5 to 20 maximum divisions
- Scales should specify size of the unit
4. Footnote: to clarify certain points in the diagram
5. Index/key : to illustrate lines or shades or colors
6. Make as simple as possible
Simple Bar Chart

Break in
scale by //
Simple Bar chart
Multiple bar chart
Stack bar chart (component bar)
Frequency histogram
How to draw a frequency histogram ?
 The X-axis, gives a continuous scale of
the measurement variable,
 The Y-axis, shows the frequency.
 For each class of the grouped data, a bar
or rectangle is drawn. The width of the bar
is the same as the class interval used.
 The area of the bars in histogram is
equivalent to the total frequency.
Frequency polygon
Frequency polygon
Graph
How to draw a straight-line frequency graph ?

 Draw the horizontal axis (the X-axis). Mark off the


scale using equal units. Use the mid-point of each
interval to represent all measurements lying within that
interval.
 Mark off the vertical axis to show the frequency,
commonly as a number, percentage or rate.
 For each class of the grouped data, mark a point
where the vertical (frequency) and the horizontal (scale)
values intersect.
 Join the marked points with straight lines.
 Graphs can be used effectively to compare two
frequency distributions, e.g. birth weight by sex.
Pie chart
Pie chart -preparation
Pie chart –preparation contd..
Scatter diagram
Summary: Choice of method for data presentation

Tabular data Diagram

Frequency table, quantitative Histogram or frequency


variable, one set of data polygon

Frequency table, quantitative Frequency polygon


variable, two sets of data

Frequency table, categorical Bar or pie chart


data

Source: WHO-Teaching health statistics, 1999; p42.


Summary –choice of method contd..
Methods Type of data Remarks
Frequency table Grouped, Continuous

Bar chart Nominal, ordinal, categorical,


discreet
Histogram Continuous, Grouped
Frequency polygon Continuous, Grouped
Cumulative frequency Continuous Compare
polygon (Ogive) quartiles
Pie chart Nominal, ordinal, categorical,
discreet
Scatter Continuous/continuous
Line graph Continuous/continuous
Stem & leaf Continuous / grouped
Box & whisker Continuous
Maps
RAVI PRAKASH JHA
Deptt. of Community Medicine
Dr. BSA Medical College
Branches of Statistical Method

Statistics

Descriptive Method Inference Method

• Descriptive method – Summarization of data as


table/diagrams,average,deviation and correlation etc.
• Inference method – Generalization of results got from the
sample to entire population .
Cont..
• Below is a list of the age of cancer patient .
• 57 68 75 66 72 86 80 81 70 78 76 72 88 84 69 77 83
90 48 63 74 81 94 51 73 96 81 66 77 101
This list is not very helpful in telling us what exactly is going on.
we used diagrams to make sense of the data, such as the simple
dot plot below:

40 50 60 70 80 90 100 110
Cont…
• We could also look at the location of the distribution, the
spread of the distribution
• location gives the ‘centre’ or ‘average’ of a set of data,spread
gives Variation in data set from their centre.
• In this chapter we will find a single numerical value to
summarise the location of the entire data set. That is, a single
figure that will tell us where abouts the data is grouped (ie a
‘typical’ value to represent the data).
Measure of Central Tendency
• Measures of central tendency are statistical indices, that
may be taken as representative of the entire data set. They
give us an idea about the VALUE around which all
observations seem to concentrate.
• Common measure of central tendency .
 Mean
 Median
 Mode
Mean
• Mean of set of observation is defined as sum of all
these observation divided by no of observation.

• Formula for calculating mean is given by-

X = ∑ xi/n
where,n=no. of observation and xi=observations
Computation of mean
• Ex1- Measurement on Blood pressure of 15 patients
were taken during their first visit to the hospital .Find
there Mean Blood pressure.
118,108,92,132,86,95,86,89,89,95,120,118,95,95,118
Mean X= ∑ xi/n
∑ x= 118+108+92+132+86+95+86+…..+118
n= 15
Mean= 1536/15= 102.4 mm hg
Practical 1
• Duration of stay in hospital after surgery of 10
patients is 9,7,8,10,9,5,6,4,11,12 find the
mean duration of hospital stay.
• Ans- 8.1
Merits
 It is rigidly defined.
 Easy to calculate
 Easy to understand.
 Based on all observation.
Demerits
 It is much affected by outlier or extreme observation
For example, the mean of data set (1,2,2,3)is 8/4 or 2.If
the number 19 is substituted for 3,the data set
becomes (1,2,2,19) and the mean is 24/4 or 6. So, the
mean 2 is more appropriate for the data set ,than the
mean 6.
 it can not calculated in with missing observation
 It can not calculated for qualitative data.
Mean in case of frequency distribution

X f f*x
1 5 5
2 9 18
3 12 36
4 17 68
5 14 70
6 10 60
7 6 42
Total 73 299
X = ∑ f*x/N
= 299/73
= 4.09
Mean in case of continuous frequency
distribution
Marks No. of stdnts(f) Mid point(x) f*x
0-10 12 5 60
10-20 18 15 270
20-30 27 25 675
30-40 20 35 700
40-50 17 45 765
50-60 6 55 330
Total 100 2,800
• Mean=∑ f*x/N
=2800/100
=28
Practical 2
• Find the mean of data set
• X- 2 5 6 3 4 10
freq.- 5 3 4 5 6 8
Ans – 5.41
Practical 3
• In School health check up hb level was estimated in
300 children .data is given in the table. Calculate the
mean hb level of the school children .
• X = ∑ f*m/ ∑f = 8.6

HB level Freq.
6-8 gm 150
8-10 gm 140
10-12 gm 10
Median
• Median is the middle value of data set after
arranging all the value in an ascending or descending
order.
• It divides series in to two parts .
Ex- 16,17,18,19,20,21,22 n= 7 (odd)
• Location of the middle value is( n+1)/2 th value of
the series.i.e (7+1)/2=8/2=4 th term i.e 19 is median
value
• If we have even no of cases then median will
calculated by taking mean of two central value .
• 16,17,18,19,20,21,22,23 n= 8 (even)
• i.e (19+20)/2=19.5
Practical 4
• Find the median of the data set
2,3,4,5,6,10,11
Ans - 5
Median in case of Freq.Distribution
• Find the median age of students
• Age – 4 5 6 7 8 9 10
• Freq.- 6 12 15 28 20 14 5
• Steps 1- Compute less than cumulative freq.(c.f)
Age Freq. c.f.
4 6 6
5 12 18
6 15 33
7 28 61
8 20 81
9 14 95
10 5 100
N= 100
Cont….
• Step 2- we find value of (n/2) th term .
i.e. n=100 or (100/2)=50
Step 3- See Cumulative freq. just greater than n/2
i.e. 61 in our case
Step 4- Corresponding value of x is median .
i.e. 7
Therefore median is 7
Practical 5
• Problem- Calculate the median for following
frequency distribution
X 1 2 3 4 5 6 7 8 9
F 8 10 11 16 20 25 15 9 6

• Ans- 5
Median for Continuous Frequency distribution
• In case of Continuous freq.distribution the class
corresponding to c.f just greater than N/2 is called median
class and value of median is obtained by following formula
• Median= l +( h/f )*(N/2-C)
• Where l is lower limit of the median class
• f is frequency of the median class
• h is width or magnitude of the median class
• C is cumulative frequency of the class preceding the median
class
• And N = ∑ f
Cont…
• Find the median marks of the students
Marks Feq. (f) c.f.
0-10 15 15
10-20 20 35
20-30 25 60
30-40 24 84
40-50 12 96
50-60 31 127
60-70 71 198
70-80 52 250
N= 250
Cont…
• Step 1- Calculate N/2 = 250/2= 125
• Step 2- Find C.f just greater than N/2 i.e. 127
• Step 3- Than find corresponding median class
Thus median class is 50-60
So what we have l= 50 ,h= 10 ,N/2=125 ,c.f. = 96
Md= l +( h/f )*(N/2-C)
= 50+(10/31)*(125-96)
= 59.3
Practical 6
• Find the median Systolic BP

SBp 90-100 100-110 110-120 120- 130- 140- 150- 160-170 170-
130 140 150 160 180
Freq. 3 5 7 10 15 11 9 6 2

• Ans- 136
Merits
• Not distorted by extreme observation .
• Easy to calculate.
• It is truly defined average as it is central
position of given data.
• The value of median can also be located
graphically.
Demerits
• It is not based on all observation .
• It is not capable for algebraic treatment.
• It can not be calculated by raw data without
arrangement.
Mode
• Mode is the value that occurs most in set of
observation .
• Ex- data of gestational age (weeks) of 10 new-borns
are given – 36,38,40,37,42,35,39,32,40,41
• We notice that there are two observation with
gestational age of 40 weeks with rest of the
observation occurring only once
Practical -7

• Variable X takes following values


1, 3, 2, 4, 5, 3, 3, 6, 7, 4, 3, 9, 3 find the Mode of
the data set
Mode in case of discrete freq.distribution

So mode is the value having maximum frequency .i.e


4
Mode for Continuous Frequency distribution

• In case of Continuous freq. distribution mode


is calculated by the formula
• Mode= l+ h(f1-f0 )/(2f1-fo-f2)
• Where l is lower limit of model class
• h is magnitude or width of model class
• f1 is frequency of model class and f0 and f2
are freq.of class preceding and succeeding the
model class respectively
Cont…
• Ex- Calculate mode of the grouped data

We see that maximum freq.is in interval 130-140 mm


therefore this is model class interval ,we have
L= 130 ,h = 10, f1=15 fo=10 , f2 = 11
So Mode= l+ h(f1-f0 )/(2f1-fo-f2)
= 130+10(15-10)/(2*15-10-11)
= 135.6
Practical - 8
• Find the mode for the following distribution
Cl 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80
freq 5 8 7 12 28 20 10 10

• Ans-46.6
Merits/Demerits
• Merits -
• Mode is easy to calculate .
• It is not affected by extreme value
• Easy to understand .
• Demerits –
• It is not based on all observation .
• It does not exist in many cases
• Ex - 19,16,14,15,20,45
Summary of when to use the mean, median and mode

• The mean is usually the best measure of central tendency to


use when your data distribution is continuous and
symmetrical, such as when your data is normally distributed.

• mode will be the best measure of central tendency (as it is the


only one appropriate to use) when dealing with nominal data

• The median is usually preferred to other measures of central


tendency when your data set is skewed (i.e., forms a skewed
distribution) or you are dealing with ordinal data.
Partition values
• These are the values which divide the series into a
number of equal parts.
• Quartiles - Three points which divided the series into
four equal parts are called quartiles.
• Quartiles tell us about the spread of a data set by
breaking the data set into quarters, just like the median
breaks it in half.
• First ,second and third points are known as 1st ,2nd and
3rd quartiles respectively namely (Q1,Q2,Q3)
• Q1 is the value which have 25 % observation before it and
75% after it.
• Q2 is coincides with median
• Q3 is the value which have 75% observation before it and
25 % after it .
Cont..
• Deciles - Nine point which divided the data set into
10 equal parts
• D1,D2,…….D9 are first second ….nine th deciles
respectively .
• Percentile – Ninety-nine points which dived the
data set into 100 equal parts.
• P1,P2,P3……P99 are the 1st ,2nd ……..99th percentiles
respectively .
Ex-
• Find Q1,Q2,Q3,D4,P27
Q2= N/2= 128
SO c,f just greater than
128 is 167 thus median
Or Q2 is 4
Similarly Q1= N/4=64
C.F just greater than 64
Is 95 hence Q1 is 3
Q3= 3N/4 =192
C.F just greater than 192
Is 219 hence Q3 is 5
D4= 4N/10=102.4 ,c.f just <
Than 102.4 is 167 hence D4 is 4
P27 = 27N/100 = 69.122 c.f just greater than 69.12 is 95 hence P27 is 3
MEASURE OF DISPERSION
Dispersion or Spread or Variability
• Measure of central tendency alone can not gives an idea
of concentration of data ,there are another measure that
gives an idea about the spread of data from there central
value ,known as measure of dispersion sometimes also
called a measure of spread.
• Definition- “The Degree to which numerical data tend to
spread about an average value is called dispersion of
data.”
• If there is no variations and all individuals are alike then
there is no need for any sample and observation ,any one
subject will tells the whole things of the rest phenomenon
being studied.
Cont…
• Here are some examples for better understanding .
• Consider the series (i)- 7,8,9,10,11 (ii)-3,6,9,12,15
mean of both the data set is 9
• we can see that both data set have same mean value
so what is difference between these data set???
• We can not form an idea about whether it is mean of
first data set or second set .thus we see that measure
of central tendency are inadequate to gives us
complete idea about data .
• Thus we required a indices to measure the variation in
data set .
Measures of Dispersion
• Range
• Quartile deviation
• Mean deviation
• Standard Deviation
• Coefficient of variation
Range
• The range is the difference between the highest and
lowest observation in a data set and is the simplest
measure of spread. So we calculate range as:
Range = maximum value –minimum value
For example, let us consider the following data set:

• 23 56 45 65 59 55 62 54 85
• Range = 85-23= 62
Range

Range = max - min


The range is strongly affected by outliers.
Cont…

Properties:
• It is simple and crude measure
• It depends on two observations only
• It is not a reliable measure
Quartile deviation

This is similar to range and is called semi inter quartile


range. If Q1and Q3 be the first and third quartiles of a
distribution respectively then the quartile deviation is
defined as
(Q3-Q1)/2
Where
Q3: Third quartile
Q1: First quartile

Smallest Q1 Q2 Q3 Largest
Cont..
• Properties –

• It is better measure than range


• It uses 50% of data and ignores 50% (25% on both sides)
• It can not be regarded as a reliable measure
(n= odd)
• Example 1: Find the first and third quartiles of the
data set {3, 7, 8, 5, 12, 14, 21, 13, 18}.
• the lower half of the data is: {3, 5, 7, 8}.
• The first quartile, Q1, is the median of {3, 5, 7, 8}.
• Since there is an even number of values, we need the
mean of the middle two values to find the first
quartile:
• .
• Similarly, the upper half of the data is: {13, 14, 18,
21}, so
• .
Cont.. (n= even)
• Find the first and third quartiles of the set {3, 7, 8, 5,
12, 14, 21, 15, 18, 14}..
• First, we write the data in increasing order: 3, 5, 7, 8,
12, 14, 14, 15, 18, 21.
• As before, the median is 13 (it is the mean of 12 and 14
— the pair of middle entries).
• Therefore, the lower half of the data is: {3, 5, 7, 8, 12}.
• Notice that 12 is included in the lower half since it is
below the median value.
• Then Q1 = 7 (there are five values in the lower half, so
the middle value is the median). Similarly, the upper
half of the data is: {14, 14, 15, 18, 21}, so Q3 = 15.
Mean Deviation
Definition:
It is average of deviation of all observations from a
central value normally to be mean, median or mode.
Only absolute values of deviations are used.
Mean deviation = Xi-A/N
In case of freq.distribution = fi Xi-A/N
Where
N= Total number of observations
A = Some central/Arbitrary value
Xi = Observations/ data points
| | = absolute value or ‘Mod’ of (Xi-A)
Common properties

 It depends on all observations.

 A better measure than range or quartile deviation

 It is minimum if measured from median (if A=Med.)

 Ignoring the sign of difference, creates artificiality

 Unsuitable for further mathematical treatment


Example -1
• 2,5,7,9,10 find mean deviation .
• Step1 – find sum of data = 2+5+6+8+9=30
• 2- find mean = 30/5= 6
• 3- Now subtracting each data form their mean
• Xi-A=  (2-6) + (5-6) + (7-6) + (9-6) + (10-6) 
= -4+-1+1+3+4
= 4+1+1+3+4 =13
Now MD = Xi-A/N
= 13/5= 2.6
Practical
• Respiratory rate of 10 asthma patients is given
Find Mean Deviation
• 18,20,19,20,18,19,16,25,23
MD for Frequency distribution
• Find MD of Data set from mean
X 1 3 5 7 9 11 13 15
f 3 3 4 14 7 4 3 4
1- Find Mean = 3*1+3*3+4*5+14*7+9*7+11*4+13*3+15*4= 8
3+3+4+14+7+4+3+4
2- find fi Xi-A = 3*1-8+3*3-8+4*5-8+14*7-8+7*9-8+4*11-8+3*13-
8+4*15-8  = = 21 +15 +12 +14 +7 +12+ 15 +28= 124
3- N= f= 3+3+4+14+7+4+3+4 = 42
4- MD = fi Xi-A/N
= 124/42= 2.95
Practical
• Find MD of data set

• Ans-4.8
Mean Deviation for continuous class
SB Frequency f Mid point x F*x Xi -X f*Xi -X

90-100 3 95 285 40.6 121.8


100-110 5 105 525 30.6 153
110-120 7 115 805 20.6 144.2
120-130 10 125 1250 10.6 106
130-140 15 135 2025 0.6 9
140-150 11 145 1595 9.4 103.4
150-160 9 155 1395 19.4 174.6
160-170 6 165 990 29.4 176.4
170-180 2 175 350 39.4 78.8
Sum= 68 Sum= 9220 1067.2
Cont..
• Mean = ∑f*x/N where N = ∑f
• Mean= 9220/68= 135.6
• 2- MD = fi Xi-A/N
= 1067.2/68
= 15.7
Practical
• Find MD of data set
Class 2-4 4-6 6-8 8-10
Interval
Freq. 5 6 3 1

• Ans- 4.6
Standard deviation
• It was first introduced by Karl Pearson ,SD is most
useful and popular measure of desperation .
• It is defined as square root of sum of squared
deviation of observations taken from there mean
divided by no of observations.it is also known as root
mean square deviation .
• It is denoted by σ .
• Variance (σ2)is the Square of SD .
1
• In case of ungrouped data - σ
N
 (X  X)
i
2

1
• In case of Frequency distribution- σ
N
 fi(Xi  X) 2
Cont…
Ex – Find the SD of the data set- 10,11,17,25,7,13,21,10,11,12,14
Sol- X (X-X) (X-X)2

10 -4 16 Where :
11 -3 9 X i  Observations
17 3 9 X  Mean of observations
25 11 121
f i  Frequency of observations
7 -7 49
= 13 -1 1
N  Number of observations

21 7 49
10 -4 16
12 -2 4
14 0 0
Mean=14 (Xi-X)2=
274
Cont..
1
σ
N
 i
(X  X) 2

= sqrt(274/10)
= sqrt(27.4)=5.23
Practical
• Find SD of data set
• 18,20,19,20,18,19,16,25,23,17
• Ans- SD(s)=2.71
• SD(p)=2.57
SD in case of Frequency distribution
• Calculate No claim Per policy
No of Frequen f*x (Xi – (Xi – F*(Xi –
Mean)
claim X cy f Mean) 2 Mean)2

1 3 3 (1-2.95) 3.8025 11.4075


2 4 8 (2-2.95) 0.9025 3.61
3 6 18 (3-2.95) 0.0025 0.015
4 5 20 (4-2.95) 1.1025 5.5125
5 2 10 (5-2.95) 4.2025 8.405
∑f=20 ∑f*x=59 28.95
Mean 59/20 =2.95 SD =SQRT(2 =1.20
8.95/20)
Practical
• Find SD

• SD(s)= 9.2
• SD (p)= 8.6
SD in case of Continuous freq.distri.
SBP f Mid point x (Xi-Mean) (Xi-Mean)2 Fi*(Xi-Mean)2

90-100 3 95 (95-127.15) 1033.623 3100.868


100-110 5 105 (105-127.15) 490.6225 2453.113
110-120 7 115 (115-127.15) 147.6225 1033.358
120-130 10 125 (125-127.15) 4.6225 46.225
130-140 15 135 (135-127.5) 56.25 843.75
140-150 11 145 (145-127.5) 306.25 3368.75
mean 127.15 Sum=10846.06
SD=
SQRT(10846.06/51)
=14.58
Practical
• Find SD of data set
CI 0-10 10-20 20-30 30-40
Freq. 2 1 1 3

• Ans- SD (p)= 12.8


Properties
 It depends on all observations

 It is least affected by sampling fluctuations

 It is suitable for further mathematical treatment

 It forms the basis of many statistical techniques

 It is difficult to compute & understand

 It gives more importance to extreme values


Coefficient of variation
• CV is defined as the SD expressed as
percentage of mean .
• CV =( σ /mean )*100
• Thus to obtained CV we should already have
the estimates of mean and SD of the data ,it
does not have any unit .
• CV used for comparison of variability .
THANKS

You might also like