0% found this document useful (0 votes)
30 views7 pages

UNIT-2 Data Preprocessing FullNotes

Data preprocessing is essential for transforming raw data into a clean and usable format for data mining, addressing issues like incompleteness, noise, and inconsistency. Key steps include data cleaning, transformation, and reduction, which enhance accuracy and efficiency. The document outlines various methods and techniques for each preprocessing step, emphasizing their importance in Knowledge Discovery in Databases.

Uploaded by

Varshitha Kn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views7 pages

UNIT-2 Data Preprocessing FullNotes

Data preprocessing is essential for transforming raw data into a clean and usable format for data mining, addressing issues like incompleteness, noise, and inconsistency. Key steps include data cleaning, transformation, and reduction, which enhance accuracy and efficiency. The document outlines various methods and techniques for each preprocessing step, emphasizing their importance in Knowledge Discovery in Databases.

Uploaded by

Varshitha Kn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

UNIT-2: Data Pre-processing

Complete Notes (Text-based with Diagrams & Tables)

Prepared as student-friendly exam study material.


1. Introduction
Data preprocessing is a crucial step in Knowledge Discovery in Databases (KDD). Raw data
collected from real-world sources is often incomplete, inconsistent, and noisy. Before applying data
mining algorithms, data must be cleaned, transformed, integrated, and reduced. Preprocessing
ensures higher accuracy, efficiency, and meaningful results.

2. Need for Preprocessing


• Raw data is often incomplete (missing attributes).
• Contains noise and errors due to faulty measurements.
• Inconsistent formats and redundancies across sources.
• Preprocessing improves accuracy and reduces algorithm complexity.
3. Data Cleaning
Data cleaning removes noise, inconsistencies, and handles missing values.

Handling Missing Values:


Method Description Example
Ignore Record Remove tuples with missing values. Drop student record with missing grade.
Fill Constant Replace with fixed value. Missing city → 'Unknown'.
Mean/Median/Mode Statistical replacement. Income replaced with mean salary.
Predictive Models Use ML models for estimation. Predict missing age using regression.

Handling Noisy Data:

Binning

Noisy Data Regression

Clustering
4. Data Transformation
Transforms data into suitable formats for mining.
Method Formula Example
Min-Max (x - min)/(max - min) Score normalization from 0-100 → 0-1
Z-Score (x - mean)/std Standardize exam scores.
Decimal Scaling Move decimal point Income 12345 → 12.345

• Aggregation: Summarizing daily → monthly data.


• Generalization: Replace values with higher-level concepts.
• Attribute Construction: Creating new attributes (e.g., BMI).
5. Data Reduction
Reduce data volume while preserving analytical value.
• Dimensionality Reduction → PCA, attribute selection.
• Data Compression → encoding (wavelet, Huffman).
• Numerosity Reduction → histograms, regression, clustering.

Original Data Reduced Data


6. Data Mining Perspectives
• Task Relevant Data: Selecting only required attributes.
• Kinds of Knowledge: Predictive (classification, regression), Descriptive (clustering, association).
• Discretization: Continuous → categorical (e.g., Age: Young, Adult, Senior).
• Concept Hierarchy: Levels of abstraction (City → State → Country).

Country

State

City
7. Quick Revision Summary
• Preprocessing = Cleaning + Transformation + Reduction.
• Data Cleaning → Handle missing values, noise, inconsistencies.
• Data Transformation → Normalization, aggregation, generalization.
• Data Reduction → Dimensionality reduction, compression, numerosity reduction.
• Data Mining Perspectives → Task relevant data, knowledge to be mined, discretization,
concept hierarchies.

You might also like