0% found this document useful (0 votes)
478 views47 pages

Overview of Business Statistics Concepts

Uploaded by

Avantika Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
478 views47 pages

Overview of Business Statistics Concepts

Uploaded by

Avantika Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

PSIT COLLEGE OF HIGHER EDUCATION, KANPUR (KN 162)

Business Statistics [F010102T(A)]


Unit-1
Syllabus
Introduction: Concept, features, significance & limitations of statistics, Types of data, Classification &
Tabulation, Frequency distribution & graphical representation.

Concept of statistics:
Statistics is the science of collecting, analyzing, interpreting, presenting, and organizing data. It involves
methods for summarizing data (descriptive statistics) and making inferences or predictions about a
population based on a sample (inferential statistics). This field includes concepts such as probability,
hypothesis testing, regression analysis, correlation, and data visualization, and it is used to make informed
decisions in various domains.
The word Statistics has been used to convey different meanings in singular and plural sense. When used as
plural, statistics means numerical set of data and when used in singular sense it means the science of
statistical methods embodying the theory and techniques used for collecting, analyzing and drawing
inferences from the numerical data.

Definitions of statistics by some well-known figures in the field:

1. Sir Ronald A. Fisher:


o "The science of statistics is the study of the use of numerical data for the
understanding and control of phenomena both of a natural and of a human kind."
2. John Tukey:
o "The best thing about being a statistician is that you get to play in everyone’s
backyard. Statistics is the science of learning from data, and of measuring,
controlling, and communicating uncertainty."
3. W. Edwards Deming:
o "Statistics is a method for analyzing the data, but it is also a method for gathering
data and for judging the trustworthiness of the conclusions drawn from the data."
4. David J. Hand:
o "Statistics is the technology of extracting meaning from data."
5. George Udny Yule and Maurice G. Kendall:
o "Statistics may be regarded as (i) the study of populations, (ii) the study of variation,
and (iii) the study of methods of the reduction of data."
6. “Statistics may be called the science of counting.” —Bowley A.L .
7. “Statistics may rightly be called the science of averages.” —Bowley A.L.
8. “Statistics is the science of the measurement of social organism, regarded as a whole in all its
manifestations.” —Bowley A.L .
9. “Statistics is the science of estimates and probabilities.” —Boddington
10. “Statistics may be defined as the science of collection, presentation, analysis and interpretation of
numerical data.” —Croxton and Cowden
11. “The science and art of handling aggregate of facts—observing, enumeration, recording,
classifying and otherwise systematically treating them.”—Harlow
12. “Statistics is the science which deals with the methods of collecting, classifying, presenting,
comparing and interpreting numerical data collected to throw some light on any sphere of
enquiry.”—Selligman

Dr. Santosh Pandey Dr. Raghvendra Singh Shiv Sagar Vishwakarma


PSIT COLLEGE OF HIGHER EDUCATION, KANPUR (KN 162)
Business Statistics [F010102T(A)]

Features of statistics:
Statistics has several key features that distinguish it and make it a vital tool for data analysis:
1. Data Collection: Systematic gathering of data from various sources, using methods like surveys,
experiments, or observational studies.
2. Data Organization: Structuring data in a meaningful way, often using tables, charts, and graphs to make
it easier to understand.
3. Summarization: Condensing large sets of data into summary measures, such as mean, median, mode,
variance, and standard deviation, to provide an overview of the data.
4. Analysis: Applying various statistical methods and models to examine relationships, trends, and patterns
within the data.
5. Interpretation: Drawing conclusions from the data analysis, determining what the results mean in
context, and understanding their implications.
6. Inference: Making predictions or generalizations about a population based on a sample, often using
techniques like hypothesis testing and confidence intervals.
7. Probability: Assessing the likelihood of different outcomes and events, which forms the basis for
inferential statistics.
8. Modeling: Creating mathematical representations of real-world processes, such as regression models, to
predict future observations or explain relationships between variables.
9. Communication: Presenting the findings of statistical analyses in a clear and effective manner, often
through reports, presentations, or visualizations like charts and graphs.
10. Decision-Making: Using statistical evidence to inform decisions in various fields, including science,
business, healthcare, and public policy.

Significance of Statistics:
The significance of statistics lies in its ability to transform raw data into meaningful information that can
be used to make informed decisions and understand complex phenomena. Here are some key points
highlighting its importance:
1. Informed Decision-Making: Statistics provide a basis for making decisions based on data rather than
intuition or guesswork. This is crucial in business, healthcare, public policy, and other fields.
2. Understanding Relationships and Trends: Through statistical analysis, one can uncover relationships
between variables and identify trends over time, helping to understand underlying patterns and causes.
3. Prediction and Forecasting: Statistical models can predict future events and trends based on historical
data. This is essential in areas like economics, weather forecasting, and risk management.

Dr. Santosh Pandey Dr. Raghvendra Singh Shiv Sagar Vishwakarma


PSIT COLLEGE OF HIGHER EDUCATION, KANPUR (KN 162)
Business Statistics [F010102T(A)]
4. Quality Control and Improvement: In manufacturing and other industries, statistics are used to monitor
and improve processes, ensuring quality and efficiency.
5. Hypothesis Testing: Statistics allow for testing hypotheses and determining the validity of claims or
theories, which is fundamental in scientific research.
6. Data Summarization and Simplification: Statistics help summarize large volumes of data into simpler
forms, such as averages and percentages, making it easier to comprehend and communicate information.
7. Identifying and Quantifying Uncertainty: Statistical methods quantify the uncertainty inherent in data
and help in making more accurate and reliable conclusions.
8. Resource Allocation: In public health, economics, and other fields, statistics guide the efficient
allocation of resources by identifying areas of greatest need or impact.
9. Policy Formulation and Evaluation: Governments and organizations use statistical data to formulate,
implement, and evaluate policies and programs.
10. Enhanced Understanding of Social Issues: Statistics provide insights into social issues like crime
rates, educational outcomes, and healthcare access, enabling targeted interventions and solutions.

Scope of Statistics:
1. Statistics in Planning
2. Statistics in State
3. Statistics in Economics
4. Statistics in Business and Management
5. Statistics in Accountancy and Auditing
6. Statistics in Industry
7. Statistics in Physical Sciences
8. Statistics in Social Sciences
9. Statistics in Biology and Medical Sciences

Limitations of statistics:
While statistics are powerful tools for analyzing data and informing decisions, they have several limitations
that must be considered:
1. Data Quality: The accuracy and reliability of statistical conclusions depend heavily on the quality of the
data collected. Poor data quality can lead to misleading results.
2. Sampling Bias: If the sample used for analysis is not representative of the population, the results may
be biased and not generalizable to the whole population.
3. Misinterpretation: Statistical results can be misinterpreted or misrepresented, leading to incorrect
conclusions. This is often due to a lack of statistical knowledge or intentional manipulation.
4. Causality vs. Correlation: Statistics can identify correlations between variables but cannot prove
causation. Determining causality requires careful experimental design and additional evidence.

Dr. Santosh Pandey Dr. Raghvendra Singh Shiv Sagar Vishwakarma


PSIT COLLEGE OF HIGHER EDUCATION, KANPUR (KN 162)
Business Statistics [F010102T(A)]
5. Overfitting: In complex models, there is a risk of overfitting, where the model fits the sample data very
well but fails to generalize to new data.
6. Assumptions: Many statistical methods rely on assumptions (e.g., normality, independence) that may
not hold true in all situations. Violating these assumptions can lead to invalid results.
7. Limited Scope: Statistics can only analyze measurable data and may not capture qualitative aspects or
context that are important for a full understanding.
8. Temporal Limitations: Statistical analysis is often based on data from a specific period and may not
account for changes over time. Historical data may not predict future trends accurately.
9. Ethical Issues: The use of statistics can raise ethical concerns, especially when data privacy is
compromised or results are used to justify biased policies.
10. Complexity: Advanced statistical methods can be complex and difficult to understand, making them
less accessible to non-experts.
11. Misleading Visualizations: Data visualizations can be manipulated to emphasize certain aspects while
downplaying others, leading to biased interpretations.
12. Dependence on Software: Statistical analysis often relies on software tools, and errors in the software
or incorrect use of these tools can affect the results.

DISTRUST OF STATISTICS:
The improper use of statistical tools by unscrupulous people with an improper statistical bend of mind
has led to the public distrust in Statistics. By this we mean that public loses its belief, faith and confidence
in the science of Statistics and starts condemning it. Such irresponsible, inexperienced and dishonest
persons who use statistical data and statistical techniques to fulfill their selfish motives have discredited the
science of Statistics with some very interesting comments, some of which are stated below :
(i) An ounce of truth will produce tons of Statistics.
(ii) Statistics can prove anything.
(iii) Figures do not lie. Liars figure.
(iv) Statistics is an unreliable science.
(v) There are three types of lies – lies, damned lies and Statistics, wicked in the order of their naming; and
so on.
Some of the reasons for the above remarks may be enumerated as follows:
(a) Figures are innocent and believable, and the facts based on them are psychologically more
convincing. But it is a pity that figures do not have the label of quality on their face.
(b) Arguments are put forward to establish certain results which are not true by making use of

Dr. Santosh Pandey Dr. Raghvendra Singh Shiv Sagar Vishwakarma


PSIT COLLEGE OF HIGHER EDUCATION, KANPUR (KN 162)
Business Statistics [F010102T(A)]
inaccurate figures or by using incomplete data, thus distorting the truth.
(c) Though accurate, the figures might be moulded and manipulated by dishonest and unscrupulous
persons to conceal the truth and present a working and distorted picture of the facts to the public for
personal and selfish motives.
if Statistics and its tools are misused, the fault does not lie with the science of Statistics. Rather,
it is the people who misuse it, are to be blamed.
Utmost care and precautions should be taken for the interpretation of statistical data in all its
manifestations. “Statistics should not be used as a blind man uses a lamp-post for support instead of
illumination.”

Data:
In the context of statistics and information technology, data refers to distinct pieces of information, often
in the form of facts, observations, measurements, or descriptions of things. Data can be qualitative or
quantitative and can vary widely in complexity and format. Here are some key characteristics and types of
data:

Characteristics of Data:
1. Raw or Processed: Data can be raw (original, unprocessed) or processed (organized, analyzed,
interpreted).
2. Structured or Unstructured: Structured data is organized in a predefined format (e.g., tables),
while unstructured data lacks a specific format (e.g., text, images).
3. Continuous or Discrete: Continuous data can take any value within a range (e.g., height, weight),
while discrete data consists of distinct, countable values (e.g., number of children).
4. Qualitative or Quantitative: Qualitative data describes qualities or characteristics (e.g., colors,
opinions), while quantitative data consists of numerical values (e.g., measurements, counts).
5. Primary or Secondary: Primary data is collected firsthand for a specific purpose, while secondary
data is obtained from existing sources (e.g., databases, literature).

Types of data:
In statistics, data types are categorized based on their measurement levels and the kind of information they
represent. Here are the primary types of data:
1. Quantitative Data: Numeric data that quantifies an attribute.
o Discrete Data: Countable values, such as the number of students in a class.

Dr. Santosh Pandey Dr. Raghvendra Singh Shiv Sagar Vishwakarma


PSIT COLLEGE OF HIGHER EDUCATION, KANPUR (KN 162)
Business Statistics [F010102T(A)]
o Continuous Data: Measurable values that can take any value within a range, such as height
or weight.
2. Qualitative Data: Non-numeric data that categorizes attributes.
o Nominal Data: Categories with no inherent order, such as colors (red, blue, green).
o Ordinal Data: Categories with a meaningful order, but no fixed intervals, such as ratings
(poor, fair, good).

Classification of Data:
Classification involves organizing data into categories or groups based on shared characteristics. This
process makes it easier to analyze and interpret data. The main types of classification are:
1. Qualitative Classification: Categorizes data based on attributes or qualities.
o Nominal Classification: Groups data without any order. Example: Gender (male, female).
o Ordinal Classification: Groups data with a meaningful order but no fixed interval.
Example: Customer satisfaction levels (low, medium, high).
2. Quantitative Classification: Categorizes data based on numerical values.
o Discrete Classification: Groups countable data. Example: Number of students in different
classes.
o Continuous Classification: Groups measurable data within intervals. Example: Height
ranges (150-160 cm, 160-170 cm).

Tabulation of Data
Tabulation is the process of summarizing data in a table format to facilitate analysis and interpretation.
Tables consist of rows and columns and can be simple or complex depending on the data. Key types of
tables include:

1. Simple Table: Presents data on a single characteristic. Example:

Age Group Number of People

0-10 15

11-20 20

21-30 25

Dr. Santosh Pandey Dr. Raghvendra Singh Shiv Sagar Vishwakarma


PSIT COLLEGE OF HIGHER EDUCATION, KANPUR (KN 162)
Business Statistics [F010102T(A)]

2. Double Table (Two-Way Table): Presents data on two characteristics simultaneously. Example:

Age Group Male Female

0-10 8 7

11-20 12 8

21-30 15 10

3. Complex Table: Presents data on more than two characteristics, often using multiple sub-tables
or additional rows/columns.
4. Frequency Distribution Table: Shows the frequency of different outcomes in a dataset.
Example:

Height Range (cm) Frequency

150-159 5

160-169 8

170-179 12

Steps in Classification and Tabulation


1. Collect Data: Gather raw data from surveys, experiments, or other sources.
2. Classify Data: Organize data into appropriate categories or groups based on shared characteristics.
3. Design Table: Determine the structure of the table, including rows, columns, headings, and sub-
headings.
4. Enter Data: Populate the table with classified data.
5. Review and Interpret: Ensure accuracy and use the table to analyze and interpret the data
effectively.

Frequency Distribution:

Dr. Santosh Pandey Dr. Raghvendra Singh Shiv Sagar Vishwakarma


PSIT COLLEGE OF HIGHER EDUCATION, KANPUR (KN 162)
Business Statistics [F010102T(A)]
Frequency distribution is a way to organize data into categories or intervals so that the frequency (count)
of data points in each category or interval can be easily seen. It helps to simplify large datasets and to
highlight patterns and trends.

Steps to Create a Frequency Distribution


1. Collect Data: Gather all the raw data you have.
2. Determine the Range: Find the difference between the maximum and minimum values.
3. Choose the Number of Intervals: Decide how many categories or intervals you want. For
continuous data, these are usually equal-sized bins.
4. Calculate Interval Width: For continuous data, divide the range by the number of intervals to find
the width of each interval.
5. Tally the Data: Count how many data points fall into each interval or category.
6. Create the Table: Make a table with intervals or categories and their corresponding frequencies.
Example of Frequency Distribution Table for Continuous Data (Height in cm):

Height Range (cm) Frequency

150-159 5

160-169 8

170-179 12

180-189 6

Example of Frequency Distribution Table for Discrete Data (Number of Books Read):

Number of Books Frequency

0-1 4

2-3 6

4-5 9

6-7 5

Dr. Santosh Pandey Dr. Raghvendra Singh Shiv Sagar Vishwakarma


PSIT COLLEGE OF HIGHER EDUCATION, KANPUR (KN 162)
Business Statistics [F010102T(A)]

Graphical Representation
Graphical representation involves visualizing data using charts and graphs to make it easier to understand
patterns and relationships. Common methods include:
1. Histogram: Used for continuous data, a histogram displays data using adjacent bars. Each bar
represents an interval, and the height of the bar corresponds to the frequency of data points in that
interval.
Steps:
o Divide the data range into equal intervals (bins).
o Draw bars for each interval where the height represents the frequency.
Example Histogram for Heights:
2. Bar Chart: Used for discrete or categorical data, a bar chart displays data with rectangular bars
where the length of the bar is proportional to the value it represents.
Steps:
o Draw bars for each category where the height represents the frequency.
Example Bar Chart for Number of Books Read:
3. Pie Chart: A circular chart divided into sectors, each representing a proportion of the whole.
Steps:
o Calculate the percentage of each category.
o Draw slices representing each percentage.
Example Pie Chart for Favorite Fruit:
4. Frequency Polygon: A line graph that connects points representing the frequencies of intervals in
a histogram.
Steps:
o Plot points at the midpoint of each interval at a height corresponding to the frequency.
o Connect the points with straight lines.
Example Frequency Polygon for Heights:
5. Cumulative Frequency Graph (Ogive): A line graph that shows the cumulative frequency
distribution.
Steps:
o Plot points corresponding to cumulative frequencies.
o Connect the points to form a curve.

Dr. Santosh Pandey Dr. Raghvendra Singh Shiv Sagar Vishwakarma


PSIT COLLEGE OF HIGHER EDUCATION, KANPUR (KN 162)
Business Statistics [F010102T(A)]
Example Ogive for Heights:
Importance of Frequency Distribution and Graphical Representation
• Simplification: Summarizes large datasets into a more manageable form.
• Pattern Recognition: Helps identify patterns, trends, and outliers.
• Comparison: Makes it easier to compare different sets of data.
• Decision Making: Facilitates better understanding and interpretation, aiding in decision-making
processes.

Important Questions of each topic

Concept of Statistics

1. What is the definition of statistics?


2. How does statistics differ from mathematics?
3. Describe the main functions of statistics.
4. Explain the role of statistics in scientific research.

Features of Statistics

1. What are the key characteristics of statistical data?


2. How does statistics help in decision-making?
3. What are the main features that distinguish descriptive statistics from inferential statistics?
4. Discuss the importance of variability in statistical analysis.

Significance of Statistics

1. How does statistics contribute to the field of economics?


2. Explain the significance of statistics in quality control and improvement.
3. What role does statistics play in medical research and public health?
4. How can statistics help in predicting future trends in various industries?

Limitations of Statistics

1. What are the common limitations of using statistical methods?


2. Discuss the potential biases that can affect statistical analysis.
3. How can misinterpretation of statistical data lead to incorrect conclusions?
4. Explain the impact of small sample sizes on the reliability of statistical results.

Types of Data

1. Define qualitative and quantitative data with examples.


2. What is the difference between discrete and continuous data?
3. How is ordinal data different from nominal data?
4. Explain the importance of identifying the type of data before performing statistical analysis.

Dr. Santosh Pandey Dr. Raghvendra Singh Shiv Sagar Vishwakarma


PSIT COLLEGE OF HIGHER EDUCATION, KANPUR (KN 162)
Business Statistics [F010102T(A)]
Classification & Tabulation of Data

1. What is the purpose of data classification in statistics?


2. Describe the difference between qualitative and quantitative classification.
3. Explain the steps involved in tabulating data.
4. How does tabulation help in data analysis and interpretation?

Frequency Distribution & Graphical Representation

1. What is a frequency distribution, and why is it important?


2. How do you create a frequency distribution table for continuous data?
3. Describe the differences between a histogram and a bar chart.
4. Explain the purpose of a cumulative frequency graph (ogive).
5. What are the advantages of using graphical representation in statistics?

Practical Questions

Classification & Tabulation of Data

Question: Classify the following data into relevant categories and create a tabulation.

Dataset:

• Ages of participants in a survey: 22, 45, 36, 27, 34, 50, 41, 29, 33, 38, 48, 55, 39, 31, 44, 23, 37,
49, 46, 30.

Task:

• Classify the ages into categories (e.g., 20-29, 30-39, 40-49, 50-59).
• Create a table showing the frequency of participants in each age category.

Frequency Distribution

Question: Create a frequency distribution table for the following dataset.

Dataset:

• Scores of students in a test: 55, 78, 85, 66, 92, 71, 82, 64, 88, 74, 95, 79, 62, 80, 67, 90, 72, 89,
76, 84.

Task:

• Decide on appropriate intervals (e.g., 60-69, 70-79, 80-89, 90-99).


• Create a frequency distribution table showing the number of students in each interval.

Dr. Santosh Pandey Dr. Raghvendra Singh Shiv Sagar Vishwakarma


PSIT COLLEGE OF HIGHER EDUCATION, KANPUR (KN 162)
Business Statistics [F010102T(A)]

Graphical Representation: Histogram

Question: Draw a histogram based on the given frequency distribution.

Dataset (from the previous question):

Score Range Frequency


60-69 4
70-79 6
80-89 7
90-99 3

Task:

• Draw a histogram representing the frequency distribution of student scores.

Graphical Representation: Pie Chart

Question: Create a pie chart representing the following data.

Dataset:

• Market share of different smartphone brands (in percentage): Apple (30%), Samsung (25%),
Huawei (20%), Xiaomi (15%), Others (10%).

Task:

• Draw a pie chart to visually represent the market share distribution.

Dr. Santosh Pandey Dr. Raghvendra Singh Shiv Sagar Vishwakarma


PSIT COLLEGE OF HIGHER EDUCATION, KANPUR (KN 162)
Business Statistics [F010102T(A)]
Unit-2
Syllabus
Measures of Central Tendency (Mean, Median, Mode), Measures of Variation (Range, Quartile Deviation,
Mean Deviation and Standard Deviation), Significance & properties of a good measure of variation,
Measures of Skewness & Kurtosis.

The concept of central tendency:


The concept of central tendency refers to the statistical measure that identifies a single value as
representative of an entire dataset. It aims to provide a summary of the data by identifying the center or
typical value. There are three common measures of central tendency:
Mean: The arithmetic average of all data points. It is calculated by adding up all the values in a dataset
and then dividing by the number of values. The mean is sensitive to extreme values (outliers).
Median: The middle value in a dataset when the values are arranged in ascending or descending order.
If there is an even number of values, the median is the average of the two middle values. The median is
not affected by outliers and provides a better measure for skewed distributions.
Mode: The most frequent value(s) in a dataset. A dataset may have one mode (unimodal), more than
one mode (bimodal or multimodal), or no mode if all values occur with the same frequency.

Examples of Central Tendency in Practice


• Income Data: Income distributions are typically right-skewed (a few individuals earn much more
than the majority). In such cases, the median is often used to report the "average" income, as the
mean could be misleadingly high due to outliers.
• Grades in a Classroom: If most students score around the same range, the mean is often a good
representative measure. However, if there are a few extremely high or low scores, the median
might give a better sense of the class’s performance.
• Mode in Retail: Retailers might be interested in the mode when they want to know the most
frequently purchased product or the most common size sold.

Mean:
The mean (or arithmetic mean) is a fundamental measure of central tendency, representing the average
value of a dataset. It is calculated by summing all the values in a dataset and dividing the sum by the total
number of values. The mean gives us a single number that summarizes the data, reflecting the general
magnitude of the values

Types of Means:
Arithmetic Mean: The standard mean that is most commonly used and described above.
Weighted Mean: Sometimes, different values in a dataset might have different levels of importance or
"weights." The weighted mean accounts for this by multiplying each value by its corresponding weight.

Dr. Santosh Pandey Dr. Raghvendra Singh Shiv Sagar Vishwakarma


PSIT COLLEGE OF HIGHER EDUCATION, KANPUR (KN 162)
Business Statistics [F010102T(A)]
Geometric Mean: Used when dealing with multiplicative relationships or rates of growth, calculated as
the nth root of the product of all values. It is often used in financial and economic contexts (e.g., average
growth rates over time).
Harmonic Mean: Used when dealing with rates or ratios (like speed). It is less commonly used than the
arithmetic mean but is useful in specific situations.

When to Use the Mean:


• Symmetrical Data: The mean works best when the data distribution is roughly symmetrical,
meaning there are no extreme outliers that can skew the result.
• Continuous Data: The mean is especially useful for data that can take any value within a range,
such as heights, weights, temperatures, etc.
• Comparative Analysis: The mean is helpful in making comparisons across different datasets
(e.g., comparing the average salary in two companies).
Advantages of the Mean:
• Easy to Understand and Calculate: The mean is widely known and simple to compute.
• Uses All Data Points: The mean considers every value in the dataset, making it a comprehensive
measure of central tendency.
• Mathematically Robust: The mean is used in many statistical analyses and tests, making it a
cornerstone of statistical theory.
Disadvantages of the Mean:
• Sensitive to Outliers: As mentioned earlier, outliers can significantly affect the mean, making it
less representative of the typical value in skewed data.
• Not Always Representative: In highly skewed distributions, the mean can give a misleading
picture of the "typical" value in a dataset.

Median:
The median is a measure of central tendency that represents the middle value in a dataset when the data is
arranged in order. It is especially useful because it is not affected by outliers or extreme values, making
it a reliable measure when dealing with skewed data or when there are significant deviations in the
dataset.

How to Calculate the Median


The method for calculating the median depends on whether the dataset contains an odd or even number
of values.
1. Odd Number of Data Points:

Dr. Santosh Pandey Dr. Raghvendra Singh Shiv Sagar Vishwakarma


PSIT COLLEGE OF HIGHER EDUCATION, KANPUR (KN 162)
Business Statistics [F010102T(A)]
• Arrange the data in ascending (or descending) order.

• The median is the value that is exactly in the middle of the dataset.
Example:
Dataset: 3,5,7,9,113, 5, 7, 9, 113,5,7,9,11
Step 1: Order the data: 3,5,7,9,113, 5, 7, 9, 113,5,7,9,11
Step 2: The middle value is 7, so the median is 7.
2. Even Number of Data Points:
• Arrange the data in order.
• If there is an even number of values, the median is the average of the two middle numbers.
Example:
Dataset: 3,5,7,93, 5, 7, 93,5,7,9
Step 1: Order the data: 3,5,7,93, 5, 7, 93,5,7,9
Step 2: The two middle values are 5 and 7.
Step 3: The median is the average of these two middle values:
Median = {5 + 7}/{2} = 6

Advantages of the Median


1. Resistant to Outliers: Since the median only depends on the middle value(s), extreme values or
outliers do not affect it. This makes it a better measure than the mean for highly skewed data or
data with outliers.
2. Simple to Understand: The median is conceptually straightforward—it's just the middle value,
which makes it easy to interpret.
3. Applicable to Ordinal Data: Since the median only requires the ability to rank the data, it can be
applied to ordinal variables, where differences between values cannot be quantified (like rankings
in a race: 1st, 2nd, 3rd, etc.).
Disadvantages of the Median
1. Ignores the Magnitude of Data Points: Unlike the mean, which considers all values in the
dataset, the median only looks at the middle value(s). This means it doesn't take into account the
exact size of the data points.
2. Not as Useful for Symmetrical Distributions: In a perfectly symmetrical dataset without
outliers, the mean often provides a more comprehensive picture because it uses all the data points,
while the median only considers the position.

Mode:

Dr. Santosh Pandey Dr. Raghvendra Singh Shiv Sagar Vishwakarma


PSIT COLLEGE OF HIGHER EDUCATION, KANPUR (KN 162)
Business Statistics [F010102T(A)]
The mode is a measure of central tendency that refers to the value or values that occur most frequently in
a dataset. Unlike the mean and median, which focus on the arithmetic or positional center, the mode
identifies the most common or popular value(s). The mode can be useful for understanding the most
typical or frequent observation in a set of data.
Key Characteristics of the Mode:
1. Frequency-Based: The mode is defined by how often a value appears in the dataset. It can occur
once or multiple times, depending on the distribution of the data.
2. Applicable to All Data Types: The mode can be used with nominal, ordinal, discrete, and
continuous data. It is particularly useful for categorical data, where numerical averages like the
mean are not meaningful (e.g., most common color, most common shoe size).
3. Multiple Modes: A dataset can have:
o Unimodal: One mode (most frequent value).
o Bimodal: Two modes (two values that occur with equal frequency).
o Multimodal: More than two modes (multiple frequent values).

How to Find the Mode


The mode is simply the value(s) that appears the most frequently. For instance:
Example 1 (Unimodal Dataset):
Consider the dataset:
3,5,5,7,8
Here, the mode is 5 because it appears twice, more than any other value.
Example 2 (Bimodal Dataset):
Consider the dataset:
1,2,2,3,4,4,5
In this case, the modes are 2 and 4 because both occur twice, and no other number occurs more
frequently.
Example 3 (No Mode):
Consider the dataset:
1,2,3,4,5
In this dataset, each value occurs exactly once, so there is no mode.

Advantages of the Mode

Dr. Santosh Pandey Dr. Raghvendra Singh Shiv Sagar Vishwakarma


PSIT COLLEGE OF HIGHER EDUCATION, KANPUR (KN 162)
Business Statistics [F010102T(A)]
1. Simple to Identify: The mode is easy to understand and find, especially in small datasets or
categorical data.
2. Can Be Used for Categorical Data: Unlike the mean and median, which require numerical data,
the mode can be used for qualitative, categorical data (e.g., most common job title, most popular
product).
3. Not Affected by Extreme Values: The mode is not influenced by outliers or extreme values,
making it useful in skewed datasets where the mean might be distorted.
4. Gives Insight into Frequency: The mode shows the most common value, which can be useful in
identifying trends or preferences (e.g., most purchased product, most common customer
complaint).
Disadvantages of the Mode
1. Not Always Unique: In some datasets, there may be multiple modes or no mode at all, which can
make the mode less informative in some cases.
2. Does Not Reflect the Entire Dataset: The mode only tells us about the most frequent value(s),
and it doesn’t take into account all the other data points in the dataset. This can be a limitation if
you need a measure that considers the overall distribution of the data.
3. Not Useful for Further Statistical Calculations: Unlike the mean, which is used in many
statistical formulas and analyses, the mode is limited in its application for more advanced
statistical work.
Examples of Mode in Real-Life Applications
1. Retail and Sales: Retailers may track the most frequently purchased item (mode) to determine
which products are most popular among customers.
2. Fashion Industry: The most common clothing size purchased in a store is an example of the
mode. This can help businesses manage inventory more effectively by stocking the most popular
sizes.
3. Education: Schools may look at the most frequent grade or score to gauge the overall
performance of students in a class. For instance, if the mode of student scores is a "B", the teacher
knows that most students are performing at that level.
4. Medicine: Medical professionals might track the most common symptoms reported by patients to
identify the most frequent health issues within a population.

Outliers:
Outliers are data points that differ significantly from the rest of the data in a dataset. They are either
unusually high or unusually low values compared to the majority of the data. Outliers can skew results,
affect statistical analyses, and may indicate errors in data collection, rare events, or important deviations.

Dr. Santosh Pandey Dr. Raghvendra Singh Shiv Sagar Vishwakarma


PSIT COLLEGE OF HIGHER EDUCATION, KANPUR (KN 162)
Business Statistics [F010102T(A)]

Difference among Mean, Median and Mode:

Sensitivity to
Measure Definition Best Use Case Data Types
Outliers

Numerical
Arithmetic average Sensitive to Symmetrical distributions,
Mean (continuous,
of the values outliers numerical data
discrete)

Middle value when Not affected by Skewed distributions,


Median Ordinal, numerical
data is ordered outliers ordinal and numerical data

Most frequent Not affected by Categorical data, Categorical,


Mode
value outliers multimodal distributions ordinal, numerical

Measures of Variation:
Measures of Variation (also known as measures of dispersion) quantify how spread out or
dispersed the data points are in a dataset. While measures of central tendency (mean, median,
mode) indicate where the center of the data lies, measures of variation describe how much the
data deviates from that center.

Common Measures of Variation:


1. Range:
The range is the simplest measure of variation. It is the difference between the largest and
smallest values in a dataset. The range gives a basic sense of how spread out the data is but only
considers the two extreme values.
Formula:
Range=Maximum value−Minimum value
Key Points:

Dr. Santosh Pandey Dr. Raghvendra Singh Shiv Sagar Vishwakarma


PSIT COLLEGE OF HIGHER EDUCATION, KANPUR (KN 162)
Business Statistics [F010102T(A)]
• Simple to calculate: It only requires knowing the highest and lowest values in the
dataset.
• Sensitive to outliers: The range can be significantly affected by extreme values
(outliers). For example, if you add a single outlier like 100 to the dataset above, the range
becomes 100−3=97100 - 3 = 97100−3=97, even though most of the values are much
closer together.
• Limited information: While the range provides some indication of the spread, it doesn’t
tell us how data points are distributed within that spread. Two datasets with the same
range can have very different distributions.
When to Use:
The range is most useful in preliminary analysis when you need a quick snapshot of the spread of
the data. However, for more detailed insights, you should use other measures of variation like
standard deviation or interquartile range.

2. Quartile Deviation
The Quartile Deviation (also known as the Semi-Interquartile Range) is a measure of
dispersion that quantifies the spread of the middle 50% of a dataset. It is based on the
interquartile range (IQR), but rather than representing the full range between the first and third
quartiles, it represents half of this range. This measure provides a more focused view of the
variability around the median.
Formula:
Quartile Deviation=(Q3−Q1)/2
Where:
• Q3 = Third quartile (the 75th percentile)
• Q1 = First quartile (the 25th percentile)
Key Points:
• Focuses on Middle 50%: The quartile deviation focuses on the central portion of the
data, reducing the influence of outliers and extreme values.
• Robust: It is less sensitive to outliers compared to the range and variance because it only
considers the middle 50% of the data.
• Simple Interpretation: Quartile deviation is easier to interpret compared to variance and
standard deviation because it is expressed in the same units as the data and focuses on the
central 50%.
When to Use:

Dr. Santosh Pandey Dr. Raghvendra Singh Shiv Sagar Vishwakarma


PSIT COLLEGE OF HIGHER EDUCATION, KANPUR (KN 162)
Business Statistics [F010102T(A)]
The quartile deviation is useful when you want to measure the spread of the middle 50% of the
data and when the dataset may have outliers or be skewed. It is especially beneficial when you
need a measure of spread that is robust to extreme values.
Mean Deviation
The Mean Deviation (also known as the Mean Absolute Deviation or MAD) is a measure of
dispersion that quantifies the average distance of each data point from the mean of the dataset.
Unlike variance and standard deviation, which square the differences from the mean, the mean
deviation uses the absolute values of these differences.
Key Points:
• Absolute Values: Mean deviation uses absolute values of deviations, making it less
sensitive to extreme values compared to variance and standard deviation.
• Easy Interpretation: Since it is based on absolute deviations, it is straightforward and
easy to understand as it is expressed in the same units as the data.
• Robustness: While not as robust as the interquartile range, the mean deviation is more
robust than the variance and standard deviation with respect to outliers because it does
not involve squaring deviations.

When to Use:

The mean deviation is useful when you want a measure of dispersion that is easy to understand
and interpret, especially when the dataset may contain outliers or when a simple measure of
spread is required. It is often used in introductory statistics and practical applications where ease
of calculation and interpretation is important.

Standard Deviation
The Standard Deviation is a widely used measure of dispersion that quantifies the amount of
variation or spread in a dataset. It indicates how much individual data points deviate, on average,
from the mean of the dataset. The standard deviation is expressed in the same units as the data,
making it easy to interpret.
Key Points:
• Expressed in Same Units: The standard deviation is expressed in the same units as the
data, making it easier to understand in context.
• Sensitive to Outliers: Because it involves squaring deviations, the standard deviation is
sensitive to extreme values or outliers. Larger deviations contribute more significantly to
the standard deviation.
• Comprehensive Measure: It provides a detailed understanding of data dispersion and is
widely used in statistical analysis, quality control, finance, and various scientific fields.

Dr. Santosh Pandey Dr. Raghvendra Singh Shiv Sagar Vishwakarma


PSIT COLLEGE OF HIGHER EDUCATION, KANPUR (KN 162)
Business Statistics [F010102T(A)]

When to Use:
The standard deviation is used when you need a comprehensive measure of variability that
considers all data points and provides insights into the dispersion of the dataset. It is widely used
in statistical analyses, including hypothesis testing, confidence intervals, and many other
applications where understanding data spread is crucial.

Significance of Measures of Variation


Measures of variation (also called measures of dispersion) are crucial in statistics because they
provide insight into the spread or dispersion of a dataset. While measures of central tendency
(like mean, median, and mode) tell us about the center of the data, measures of variation give us
a sense of how much the data points differ from each other and from the central value.
Why Are Measures of Variation Important?
1. Understanding Data Spread: Measures of variation help quantify how much the data
values spread out or cluster around the central value (mean or median). Knowing the
variation is essential for understanding the diversity or uniformity within a dataset.
o Example: In a dataset of test scores, if all scores are very close to the mean, the
variation is low, indicating that most students performed similarly. If the scores
are widely spread out, the variation is high, indicating a more diverse
performance.
2. Comparing Datasets: They allow us to compare the consistency of different datasets.
Two datasets may have the same mean but very different variabilities, which can lead to
different conclusions.
o Example: Two investment portfolios might have the same average return, but if
one has a high variance in returns (greater risk), it is riskier than the other
portfolio with lower variance.
3. Identifying Outliers: Measures of variation can help detect outliers. When the variation
is high, there may be extreme values that are far from the center of the dataset.
4. Making Predictions and Decisions: A low variation implies consistency, which is useful
for making predictions or drawing conclusions. In contrast, high variation implies
unpredictability, which may require more caution in decision-making.
o Example: In quality control, low variation in product measurements indicates
consistency, whereas high variation signals potential issues in the manufacturing
process.
5. Supporting Hypothesis Testing: Measures of variation are used in hypothesis testing,
confidence intervals, and many other statistical techniques. They help assess the

Dr. Santosh Pandey Dr. Raghvendra Singh Shiv Sagar Vishwakarma


PSIT COLLEGE OF HIGHER EDUCATION, KANPUR (KN 162)
Business Statistics [F010102T(A)]
reliability and precision of estimates and determine whether differences between datasets
are significant.
Common Measures of Variation
1. Range: The difference between the highest and lowest values in the dataset.
o Significance: Simple and easy to calculate but sensitive to outliers.
2. Variance: The average of the squared differences from the mean. It measures how spread
out the data points are around the mean.
o Significance: It provides a more comprehensive view of dispersion by
considering all data points and their distance from the mean.
3. Standard Deviation: The square root of variance. It is a widely used measure because it
expresses the dispersion in the same units as the data.
o Significance: Standard deviation is easy to interpret and is commonly used in
finance, science, and other fields.
4. Interquartile Range (IQR): The range of the middle 50% of the data. It is the difference
between the third quartile (Q3) and the first quartile (Q1).
o Significance: IQR is not affected by outliers and is ideal for skewed datasets or
datasets with extreme values.
5. Mean Absolute Deviation (MAD): The average of the absolute differences between each
data point and the mean.
o Significance: Provides a straightforward way to measure dispersion without
squaring deviations, so it is less sensitive to extreme values than variance.

Properties of Good Measures of Variation


1. Sensitivity to All Data Points: A good measure of variation should take into account all
data points in the dataset, not just a few extreme values. For instance, the variance and
standard deviation are based on every data point's deviation from the mean.
o Example: Standard deviation and variance consider all data points, unlike the
range, which only considers the minimum and maximum values.
2. Non-Negative: A good measure of variation should always be non-negative, since
variation represents the degree of spread, which cannot be negative. Standard deviation,
variance, and IQR are always positive.
3. Consistency in Interpretation: A smaller measure of variation should always indicate a
tighter clustering of data points around the center, while a larger measure should indicate

Dr. Santosh Pandey Dr. Raghvendra Singh Shiv Sagar Vishwakarma


PSIT COLLEGE OF HIGHER EDUCATION, KANPUR (KN 162)
Business Statistics [F010102T(A)]
greater spread. For example, a lower standard deviation indicates that data points are
close to the mean.
4. Resistant to Outliers: Some measures of variation, like the interquartile range (IQR),
are less affected by outliers, making them more robust for skewed data. A good measure
of variation should either handle outliers or provide insight into how much they affect the
spread of the data.
5. Mathematical Simplicity: Measures of variation should be easy to compute and
interpret. While variance and standard deviation involve squaring deviations from the
mean, they are still widely used because of their clear interpretation.
6. Comparable Across Different Datasets: A good measure of variation should allow
comparisons between different datasets. For instance, standard deviation allows us to
compare the variability of datasets, even if they have different units or scales.
7. Consistent with the Mean: Ideally, a good measure of variation should complement the
mean. Measures like variance and standard deviation are based on deviations from the
mean, which makes them directly linked to it.
8. Invariance under Linear Transformation: A measure of variation should behave
predictably under scaling and shifting of data. For example, multiplying all data points by
a constant will affect the variance and standard deviation in a predictable way (variance
will increase by the square of the constant).

Summary of Key Measures of Variation:

Sensitivity to
Measure Key Features Interpretation
Outliers

Difference between max and Simple, but not


Range Highly sensitive
min values robust

Average squared deviation from Comprehensive,


Variance Sensitive
the mean squared

Standard Deviation Square root of variance Sensitive Same units as data

Spread of the middle 50% of Good for skewed


Interquartile Range Resistant
data data

Mean Absolute Average of absolute deviations


Less sensitive Simple interpretation
Deviation from the mean

Dr. Santosh Pandey Dr. Raghvendra Singh Shiv Sagar Vishwakarma


PSIT COLLEGE OF HIGHER EDUCATION, KANPUR (KN 162)
Business Statistics [F010102T(A)]

Measures of Skewness and Kurtosis


Skewness and kurtosis are statistical measures that describe the shape of a distribution. They
provide insights into the asymmetry and the tails of the distribution, respectively.

1. Skewness
Skewness measures the asymmetry of the probability distribution of a real-valued random
variable. It tells us whether the data is skewed to the left (negative skewness) or to the right
(positive skewness) compared to a normal distribution.
Interpretation:
• Skewness = 0: The distribution is symmetric.
• Skewness > 0: The distribution is positively skewed (right-skewed); the right tail is
longer or fatter.
• Skewness < 0: The distribution is negatively skewed (left-skewed); the left tail is longer
or fatter.

2. Kurtosis
Kurtosis measures the "tailedness" of the probability distribution. It provides an indication of the
presence of outliers and the peakedness of the distribution compared to a normal distribution.
Interpretation:
• Kurtosis = 0: The distribution has the same kurtosis as a normal distribution
(mesokurtic).
• Kurtosis > 0: The distribution has heavier tails and a sharper peak compared to a normal
distribution (leptokurtic); it indicates the presence of outliers.
• Kurtosis < 0: The distribution has lighter tails and a flatter peak compared to a normal
distribution (platykurtic); it indicates fewer outliers.

Difference between Skewness and Kurtosis:


Skewness measures the asymmetry of the distribution. Positive skewness indicates a longer or
fatter right tail, while negative skewness indicates a longer or fatter left tail.
Kurtosis measures the tails' heaviness or lightness. Positive kurtosis indicates heavy tails and a
sharper peak, while negative kurtosis indicates light tails and a flatter peak.

Dr. Santosh Pandey Dr. Raghvendra Singh Shiv Sagar Vishwakarma


PSIT COLLEGE OF HIGHER EDUCATION, KANPUR (KN 162)
Business Statistics [F010102T(A)]
Unit-III
Syllabus
Correlation and Regression: Meaning and Types of Correlation, simple correlation, Scatter diagram method,
Karl Pearson's Coefficient of correlation, Significance of correlation, Regression concept, Regression
concept, Regression equations and Regression coefficient.

Correlation:
Correlation is a fundamental concept in statistics that measures the strength and direction of the
relationship between two variables. It is an essential tool for understanding how two variables
move together and is commonly used across various disciplines, including economics, finance,
psychology, and medicine, to interpret data and relationships

Types of Correlation
1. Positive Correlation:
o When one variable increases, the other variable also tends to increase.
o Example: As the number of hours studied increases, the exam scores typically
increase. This is known as a direct relationship.
o Values of the correlation coefficient closer to +1 indicate a strong positive
relationship.
2. Negative Correlation:
o When one variable increases, the other variable tends to decrease.
o Example: As the number of hours spent watching TV increases, the exam scores
tend to decrease. This is known as an inverse relationship.
o Values closer to -1 indicate a strong negative relationship.
3. No Correlation:
o When there is no consistent relationship between the variables.
o Example: The amount of coffee consumed and a person’s height. No discernible
pattern links the two variables.
o A correlation coefficient around 0 suggests no correlation.

Scatter Diagram Method


The scatter diagram, also known as a scatter plot or scatter graph, is a graphical method used
to visually represent the relationship between two variables. It consists of plotting data points on

Dr. Santosh Pandey Dr. Raghvendra Singh Shiv Sagar Vishwakarma


PSIT COLLEGE OF HIGHER EDUCATION, KANPUR (KN 162)
Business Statistics [F010102T(A)]
a two-dimensional graph, where each point represents an observation with coordinates
corresponding to two variables. This method helps in identifying the correlation (relationship)
between the two variables.
Key Components of a Scatter Diagram:
1. X-Axis: Represents the independent variable (also called the predictor or explanatory
variable).
2. Y-Axis: Represents the dependent variable (also called the response variable).
3. Data Points: Each data point on the graph corresponds to an observation and shows the
values of both the independent and dependent variables.

Advantages of the Scatter Diagram Method:


1. Simple and Visual: A scatter diagram provides a quick, visual way to identify
relationships between two variables without complex calculations.
2. Identifies Correlation: It helps to see if variables have a positive, negative, or no
correlation. The closer the points are to forming a straight line, the stronger the
correlation.
3. Detects Outliers: Scatter plots can easily reveal outliers or extreme values that deviate
significantly from the overall pattern.
4. Flexible: It can be used for both linear and non-linear relationships.
Types of Correlation Measures
1. Pearson's Correlation Coefficient:
o Measures the strength of the linear relationship between two continuous variables.
o Assumes a normal distribution of data and a linear relationship between variables.
2. Spearman’s Rank Correlation:
o A non-parametric measure that assesses the monotonic relationship between two
variables. It is useful when data are not normally distributed or the relationship is
not linear.
o Used when both variables are ranked (ordinal data) or not normally distributed.
o Spearman's rank correlation coefficient is denoted by ρ\rhoρ (rho).

Karl Pearson's Coefficient of Correlation:

Dr. Santosh Pandey Dr. Raghvendra Singh Shiv Sagar Vishwakarma


PSIT COLLEGE OF HIGHER EDUCATION, KANPUR (KN 162)
Business Statistics [F010102T(A)]
Karl Pearson's Coefficient of Correlation (often denoted as rrr) is a statistical measure that
quantifies the strength and direction of the linear relationship between two continuous variables.
It ranges from -1 to +1, where:
• +1 indicates a perfect positive linear relationship,
• -1 indicates a perfect negative linear relationship,
• 0 indicates no linear relationship.
Formula of Karl Pearson's Coefficient of Correlation:
If we have the means x‾ and y‾ of the data sets X and Y, we can use:

Spearman's Rank Correlation:


Spearman’s Rank Correlation Coefficient (denoted as ρ\rhoρ or rsr_srs) is a non-parametric
measure of the strength and direction of the ranked (ordinal) relationship between two variables.
It assesses how well the relationship between two variables can be described using a monotonic
function (a relationship that consistently increases or decreases, but not necessarily at a constant
rate).
Unlike Pearson's Correlation, which requires linear and normally distributed data, Spearman's
Rank Correlation is suitable for both linear and non-linear relationships, and it can handle
ordinal, interval, or ratio data.

Dr. Santosh Pandey Dr. Raghvendra Singh Shiv Sagar Vishwakarma


PSIT COLLEGE OF HIGHER EDUCATION, KANPUR (KN 162)
Business Statistics [F010102T(A)]

Steps to Calculate Spearman’s Rank Correlation:


1. Assign Ranks: Rank the values of each variable. If two or more values are the same,
assign them the average of the ranks they would have held.
2. Calculate the Difference in Ranks: For each pair of ranks (for variables X and Y), find
the difference di.
3. Square the Differences: Square the rank differences, for each pair.
4. Substitute Values into the Formula: Use the formula to calculate.

Significance of Correlation in Statistics


1. Measures Relationships: Correlation quantifies the strength and direction of the
relationship between two variables, helping to understand how they move together.
2. Predictive Power: A strong correlation between two variables allows for predictions
about one variable based on the other.
3. Foundation for Regression: Correlation is the basis for regression analysis, which
models the relationship between variables for predictive purposes.
4. Informs Decision-Making: In fields like business, economics, and healthcare,
correlation guides decisions by identifying important variable relationships (e.g., sales vs.
advertising spend).

Dr. Santosh Pandey Dr. Raghvendra Singh Shiv Sagar Vishwakarma


PSIT COLLEGE OF HIGHER EDUCATION, KANPUR (KN 162)
Business Statistics [F010102T(A)]
5. Simplifies Data: It helps reduce complexity by focusing on key variables with strong
relationships, making data easier to analyze.
6. Detects Multicollinearity: In multivariate analysis, correlation helps identify
multicollinearity, where predictor variables are highly correlated, which can distort
results.
7. No Causation: While it identifies relationships, correlation does not imply causation,
meaning it cannot confirm that one variable causes changes in another.

Limitations of Correlation:
While correlation is a powerful tool, it has limitations that must be kept in mind:
• Correlation Does Not Imply Causation: Even if two variables are highly correlated, it
does not mean that one causes the other. External factors or a third variable could be
driving the relationship.
o Example: Ice cream sales and drowning incidents may be correlated because both
occur more frequently in the summer, but eating ice cream does not cause
drowning.
• Outliers Can Skew Results: Extreme values (outliers) in the data can distort the
correlation coefficient, giving a misleading picture of the strength and direction of the
relationship.
• Only Measures Linear Relationships: Pearson’s correlation coefficient only captures
linear relationships. If two variables have a strong non-linear relationship, Pearson’s rrr
may be close to zero even if there is a strong association.
• Multicollinearity Issues: In multivariate analysis, if two predictor variables are highly
correlated (multicollinearity), this can create problems in interpreting the results of
regression analysis.

Regression Analysis:
Regression analysis is a powerful statistical method used to examine the relationship between
two or more variables. It helps in predicting the value of a dependent variable (also known as the
response or outcome variable) based on the value(s) of one or more independent variables (also
called predictors or explanatory variables).
The primary goal of regression is to model the relationship between variables and use it to make
predictions, explain trends, or determine the strength of associations.

Dr. Santosh Pandey Dr. Raghvendra Singh Shiv Sagar Vishwakarma


PSIT COLLEGE OF HIGHER EDUCATION, KANPUR (KN 162)
Business Statistics [F010102T(A)]
Types of Regression:
1. Simple Linear Regression:
o Examines the relationship between two variables: one independent variable and
one dependent variable.
o The model assumes a linear relationship between the variables.
o Example: Predicting a student's exam score (dependent variable) based on the
number of hours studied (independent variable).
2. Multiple Linear Regression:
o Involves two or more independent variables and one dependent variable.
o It helps in understanding how multiple factors contribute to the outcome.
o Example: Predicting house prices (dependent variable) based on factors like
square footage, number of bedrooms, and location (independent variables).
3. Non-Linear Regression:
o Models a non-linear relationship between the dependent and independent
variables. It is useful when data does not follow a straight-line trend.

Components of a Regression Model:


1. Dependent Variable (Y): The variable we want to predict or explain (e.g., salary, house
price).
2. Independent Variable(s) (X): The predictor variable(s) that are used to predict the
dependent variable (e.g., years of experience, square footage).
3. Intercept (a): The value of Y when X=0. It represents the starting point of the regression
line.
4. Slope (b): Indicates how much the dependent variable changes for a unit change in the
independent variable.

Regression Lines and Equations


In regression analysis, a regression line is a straight line that best represents the data in a scatter
plot. The purpose of the regression line is to model the relationship between an independent
variable (X) and a dependent variable (Y). The line is used to predict values of Y given values of
X.
There are two types of regression lines depending on which variable we are predicting:

Dr. Santosh Pandey Dr. Raghvendra Singh Shiv Sagar Vishwakarma


PSIT COLLEGE OF HIGHER EDUCATION, KANPUR (KN 162)
Business Statistics [F010102T(A)]
1. Line of Regression of Y on X: Predicts Y based on X.
2. Line of Regression of X on Y: Predicts X based on Y.

Key Points to Remember About Regression Lines:


1. Line of Best Fit: The regression line minimizes the sum of the squared differences
between the observed and predicted values (this is known as the least squares criterion).
2. Linear Relationship: The line represents the best linear relationship between X and Y.
3. Prediction: The regression line allows you to make predictions of Y given any value of X
(or vice versa for the regression line of X on Y).
4. Two Types of Lines: Always distinguish between the regression line of Y on X
(predicting Y given X) and X on Y (predicting X given Y).

Regression Coefficient:
In simple linear regression, the regression coefficient refers specifically to the slope (b) of the
regression line, which describes the relationship between the independent variable X and the
dependent variable Y. This coefficient quantifies how much the dependent variable changes when
the independent variable changes by one unit.

Dr. Santosh Pandey Dr. Raghvendra Singh Shiv Sagar Vishwakarma


PSIT COLLEGE OF HIGHER EDUCATION, KANPUR (KN 162)
Business Statistics [F010102T(A)]

Interpretation of the Regression Coefficient:


• Positive Coefficient: If b>0, it means there is a positive relationship between X and Y.
As X increases, Y increases. The greater the value of b, the steeper the slope.
• Negative Coefficient: If b<0, it indicates a negative relationship between X and Y. As X
increases, Y decreases.

Dr. Santosh Pandey Dr. Raghvendra Singh Shiv Sagar Vishwakarma


PSIT COLLEGE OF HIGHER EDUCATION, KANPUR (KN 162)
Business Statistics [F010102T(A)]
• Zero Coefficient: If b=0, it means there is no linear relationship between X and Y.
Changes in X do not affect Y.

Difference between Correlation and Regression:

• Correlation is used to measure the degree of association between two variables without
assuming one causes the other.
• Regression is a more advanced tool used to make predictions and establish cause-and-
effect relationships, especially in situations where we want to understand how one
variable affects another.

Dr. Santosh Pandey Dr. Raghvendra Singh Shiv Sagar Vishwakarma


PSIT COLLEGE OF HIGHER EDUCATION, KANPUR (KN 162)
Business Statistics [F010102T(A)]
Unit-IV
Syllabus
Probability: Concept, Events, Addition Law, Conditional Probability, Multiplication Law & Introduction
to Bayes' theorem [Simple numerical]. Introduction to Probability Distribution: Binomial, Poisson and
Normal. Sampling: Method of sampling, Sampling and non-sampling errors, Introduction to Test of
hypothesis, Type-I and Type-II Errors, Large sample tests. Introduction to MS Excel and its use in
Business statistics.

Probability:

Important terms in probability:

Dr. Santosh Pandey Dr. Raghvendra Singh Shiv Sagar Vishwakarma


PSIT COLLEGE OF HIGHER EDUCATION, KANPUR (KN 162)
Business Statistics [F010102T(A)]

Dr. Santosh Pandey Dr. Raghvendra Singh Shiv Sagar Vishwakarma


PSIT COLLEGE OF HIGHER EDUCATION, KANPUR (KN 162)
Business Statistics [F010102T(A)]

Dr. Santosh Pandey Dr. Raghvendra Singh Shiv Sagar Vishwakarma


PSIT COLLEGE OF HIGHER EDUCATION, KANPUR (KN 162)
Business Statistics [F010102T(A)]

Dr. Santosh Pandey Dr. Raghvendra Singh Shiv Sagar Vishwakarma


PSIT COLLEGE OF HIGHER EDUCATION, KANPUR (KN 162)
Business Statistics [F010102T(A)]

Dr. Santosh Pandey Dr. Raghvendra Singh Shiv Sagar Vishwakarma


PSIT COLLEGE OF HIGHER EDUCATION, KANPUR (KN 162)
Business Statistics [F010102T(A)]

Dr. Santosh Pandey Dr. Raghvendra Singh Shiv Sagar Vishwakarma


PSIT COLLEGE OF HIGHER EDUCATION, KANPUR (KN 162)
Business Statistics [F010102T(A)]

Introduction to Probability Distribution:


A probability distribution describes how the probabilities of a random variable are distributed. It
provides a framework for understanding the likelihood of various outcomes in a given situation. There are
several types of probability distributions, but three commonly used ones are the Binomial Distribution,
Poisson Distribution, and Normal Distribution.

1. Binomial Distribution:
Definition
The Binomial Distribution is a discrete probability distribution that describes the number of successes in
a fixed number of independent trials of a binary experiment (an experiment with two possible outcomes,
often termed "success" and "failure").
Parameters
• n: The number of trials (fixed).
• p: The probability of success on each trial (0 ≤ p ≤ 1).
Assumptions
1. There are a fixed number of trials (n).

Dr. Santosh Pandey Dr. Raghvendra Singh Shiv Sagar Vishwakarma


PSIT COLLEGE OF HIGHER EDUCATION, KANPUR (KN 162)
Business Statistics [F010102T(A)]
2. Each trial is independent of others.
3. Each trial has only two possible outcomes: success or failure.
4. The probability of success (p) remains constant for each trial.

2. Poisson Distribution:
Definition
The Poisson Distribution is a discrete probability distribution that models the number of events
occurring in a fixed interval of time or space, under the assumption that these events occur independently
of one another and at a constant average rate.
Parameters
• λ: The average number of events in the given interval (mean rate of occurrence).
Assumptions
1. Events occur independently.
2. The average rate (λ) is constant.
3. Two or more events cannot occur at exactly the same time.

Dr. Santosh Pandey Dr. Raghvendra Singh Shiv Sagar Vishwakarma


PSIT COLLEGE OF HIGHER EDUCATION, KANPUR (KN 162)
Business Statistics [F010102T(A)]

Difference among Binomial, Poisson and Normal Distribution:

Dr. Santosh Pandey Dr. Raghvendra Singh Shiv Sagar Vishwakarma


PSIT COLLEGE OF HIGHER EDUCATION, KANPUR (KN 162)
Business Statistics [F010102T(A)]
Binomial: Fixed trials, success/failure outcomes.
Poisson: Counts events in a time period, like rare occurrences.
Normal: Continuous data with a bell curve shape.

Sampling:
Sampling is the process of selecting a subset of individuals or observations from a larger population to
estimate characteristics of the whole population. Proper sampling helps ensure that the sample accurately
represents the population, leading to reliable results in research and analysis.

Methods of Sampling:
1. Probability Sampling: Every member of the population has a known and non-zero chance of
being selected. This method allows for statistical inferences to be made about the population.
o Simple Random Sampling: Each member of the population has an equal chance of
being selected. This can be achieved using random number generators or lottery methods.
o Stratified Sampling: The population is divided into strata (groups) based on certain
characteristics (e.g., age, gender), and random samples are taken from each stratum. This
ensures representation from all sub-groups.
o Systematic Sampling: Members of the population are selected at regular intervals (e.g.,
every 10th member) from a randomly selected starting point.
o Cluster Sampling: The population is divided into clusters (usually geographically), and
entire clusters are randomly selected. This method is often used for large populations.
2. Non-Probability Sampling: Not every member of the population has a known chance of being
selected. This method may lead to biases in the results.
o Convenience Sampling: Samples are taken from a group that is easily accessible (e.g.,
surveying people in a mall).
o Judgmental Sampling: The researcher uses their judgment to select subjects who they
believe are most representative of the population.
o Snowball Sampling: Existing study subjects recruit future subjects from among their
acquaintances. This method is often used for hard-to-reach populations.

Sampling Errors vs. Non-Sampling Errors:


Sampling Errors:
Sampling errors occur when the sample selected does not perfectly represent the population due to
random chance. These errors are inherent in the sampling process and can be reduced by increasing the
sample size. Examples include:

Dr. Santosh Pandey Dr. Raghvendra Singh Shiv Sagar Vishwakarma


PSIT COLLEGE OF HIGHER EDUCATION, KANPUR (KN 162)
Business Statistics [F010102T(A)]
• Random Sampling Error: The difference between the sample statistic and the actual population
parameter that arises purely by chance. For instance, if a sample mean is calculated, it may differ
from the true population mean simply due to random selection.
• Systematic Sampling Error: Errors that arise from the way samples are selected, leading to a
bias. For example, if only certain clusters are chosen, it may not reflect the entire population
accurately.

Non-Sampling Errors:
Non-sampling errors are errors that occur regardless of how the sample is selected. These can introduce
bias and inaccuracies into the results. Examples include:
• Measurement Error: Errors that occur during data collection. For instance, if a survey question
is poorly worded, it may lead to inaccurate responses.
• Non-Response Error: Occurs when a significant number of individuals selected for the sample
do not respond or participate. This can skew results if the non-respondents differ significantly
from those who do respond.
• Data Processing Error: Mistakes made during data entry, coding, or analysis can lead to
incorrect results.
• Selection Bias: This occurs when the sample is not representative of the population due to
systematic differences between the sample and the population.

Introduction to the Test of Hypothesis:


Hypothesis testing is a statistical method that allows researchers to make decisions or inferences about a
population based on sample data. It involves formulating a hypothesis, collecting data, and then
determining the likelihood that the hypothesis is true. This process is fundamental in many fields,
including science, medicine, social sciences, and business.

Key Concepts:
1. Hypothesis: A hypothesis is a statement that can be tested and is subject to validation or
rejection. There are two types of hypotheses:
o Null Hypothesis (H0): This is the hypothesis that there is no effect or no difference. It
serves as the default assumption that any observed effect in the data is due to sampling
variability.
o Alternative Hypothesis (H1 or Ha): This hypothesis represents what the researcher aims
to prove, suggesting that there is an effect or a difference.
2. Test Statistic: A test statistic is a standardized value derived from sample data. It is used to
determine whether to reject the null hypothesis. The type of test statistic depends on the nature of
the data and the hypothesis being tested (e.g., t-test, z-test, F-test).

Dr. Santosh Pandey Dr. Raghvendra Singh Shiv Sagar Vishwakarma


PSIT COLLEGE OF HIGHER EDUCATION, KANPUR (KN 162)
Business Statistics [F010102T(A)]
3. Significance Level (α): This is the threshold for determining whether the null hypothesis should
be rejected. Common significance levels are 0.05, 0.01, or 0.10. It represents the probability of
rejecting the null hypothesis when it is actually true (Type I error).
4. P-Value: The p-value is the probability of obtaining a test statistic as extreme as the one
observed, under the assumption that the null hypothesis is true. If the p-value is less than or equal
to the significance level (α), the null hypothesis is rejected.
5. Decision Rule: Based on the test statistic and p-value, a decision is made:
o Reject H0: If the p-value is less than or equal to α\alphaα, or if the test statistic falls in
the critical region.
o Fail to Reject H0: If the p-value is greater than α\alphaα.

Steps in Hypothesis Testing:


1. State the Hypotheses: Define the null and alternative hypotheses.
o H0: No effect or difference.
o H1: There is an effect or difference.
2. Choose the Significance Level (α\alpha): Determine the threshold for rejecting the null
hypothesis (commonly 0.05).
3. Select the Appropriate Test: Based on the type of data and the hypothesis, choose a statistical
test (e.g., t-test for comparing means, chi-square test for categorical data).
4. Collect Data and Calculate the Test Statistic: Gather the necessary data and compute the test
statistic based on the chosen method.
5. Calculate the P-Value: Determine the p-value associated with the test statistic.
6. Make a Decision: Compare the p-value to the significance level and make a decision about the
null hypothesis.
7. Draw a Conclusion: Interpret the results in the context of the research question, discussing the
implications of the findings.

Type I Error (α\alpha):


• Definition: A Type I error occurs when the null hypothesis (H0) is rejected when it is actually
true. This means that you conclude there is an effect or a difference when, in fact, there isn't one.
• Example: Suppose a new drug is tested to see if it is more effective than an existing drug. If the
test concludes that the new drug is more effective (rejects H0) when, in reality, both drugs have
the same effectiveness, this would be a Type I error.
• Significance Level (α\alphaα): The probability of committing a Type I error is denoted by α,
which is usually set at levels like 0.05 or 0.01. For instance, if α=0.05, it means there is a 5% risk
of rejecting the null hypothesis when it is true.

Dr. Santosh Pandey Dr. Raghvendra Singh Shiv Sagar Vishwakarma


PSIT COLLEGE OF HIGHER EDUCATION, KANPUR (KN 162)
Business Statistics [F010102T(A)]
Type II Error (β\beta):
• Definition: A Type II error occurs when the null hypothesis (H0) is not rejected when it is
actually false. This means that you conclude there is no effect or difference when, in reality, there
is one.
• Example: Using the same drug example, if the test fails to reject the null hypothesis (concludes
no difference) when the new drug is actually more effective than the existing drug, this is a Type
II error.
• Power of the Test: The probability of correctly rejecting a false null hypothesis is called the
power of the test, denoted as 1−β. A higher power means a lower probability of making a Type II
error.

Large Sample Tests:


Large Sample Tests are statistical methods used to make inferences about population parameters based
on sample data when the sample size is sufficiently large (usually n≥30n). When sample sizes are large,
certain statistical properties allow us to use the normal distribution as an approximation for other
distributions, making it easier to conduct hypothesis tests and construct confidence intervals.
Key Characteristics of Large Sample Tests
1. Central Limit Theorem (CLT): The CLT states that the sampling distribution of the sample
mean approaches a normal distribution as the sample size increases, regardless of the population's
distribution, provided the samples are independent and identically distributed.
2. Normal Approximation: For large samples, the sample means can be assumed to be normally
distributed. This allows us to use the Z-test for hypothesis testing and constructing confidence
intervals.
3. Robustness: Large sample tests are generally robust, meaning that they can still provide valid
results even if the underlying assumptions are slightly violated.

Dr. Santosh Pandey Dr. Raghvendra Singh Shiv Sagar Vishwakarma


PSIT COLLEGE OF HIGHER EDUCATION, KANPUR (KN 162)
Business Statistics [F010102T(A)]
Introduction to MS Excel:
Microsoft Excel is a powerful spreadsheet application that is widely used for data analysis, financial
modeling, and statistical calculations. It provides a user-friendly interface and a variety of tools for
organizing, manipulating, and visualizing data, making it an essential tool in business settings.

Key Features of MS Excel:


1. Spreadsheet Functionality: Excel allows users to create and manage spreadsheets with rows and
columns for data entry.
2. Formulas and Functions: Users can perform calculations using built-in mathematical functions
(e.g., SUM, AVERAGE, COUNT) and custom formulas to analyze data.
3. Data Visualization: Excel provides a range of charting options (e.g., bar charts, line graphs, pie
charts) to visually represent data trends and patterns.
4. Data Management: Users can sort, filter, and organize data, making it easier to analyze and
derive insights.
5. Statistical Analysis Tools: Excel includes various tools for conducting statistical analyses, such
as descriptive statistics, regression analysis, and hypothesis testing.

Use of MS Excel in Business Statistics:


1. Data Analysis: Excel enables businesses to analyze large datasets efficiently. Users can perform
descriptive statistics (mean, median, mode, standard deviation) to summarize data.
2. Trend Analysis: By using Excel’s charting capabilities, businesses can visualize trends over time,
helping to identify patterns and make informed decisions.
3. Forecasting: Excel’s built-in functions allow businesses to create forecasts based on historical
data, aiding in planning and strategy development.
4. Regression Analysis: Excel supports regression analysis, enabling businesses to assess
relationships between variables and predict future outcomes.
5. What-If Analysis: Excel allows users to conduct scenario analysis using tools like Goal Seek and
Data Tables, helping businesses evaluate different decision paths based on variable changes.
6. Reporting: Excel is widely used to create reports and dashboards, providing stakeholders with
insights into business performance and key metrics.
7. Data Visualization: The ability to create dynamic charts and graphs helps present data in an
understandable format for presentations and meetings.

Dr. Santosh Pandey Dr. Raghvendra Singh Shiv Sagar Vishwakarma

You might also like