UNIT-2: Data Pre-processing
Complete Notes (Text-based with Diagrams & Tables)
Prepared as student-friendly exam study material.
1. Introduction
Data preprocessing is a crucial step in Knowledge Discovery in Databases (KDD). Raw data
collected from real-world sources is often incomplete, inconsistent, and noisy. Before applying data
mining algorithms, data must be cleaned, transformed, integrated, and reduced. Preprocessing
ensures higher accuracy, efficiency, and meaningful results.
2. Need for Preprocessing
• Raw data is often incomplete (missing attributes).
• Contains noise and errors due to faulty measurements.
• Inconsistent formats and redundancies across sources.
• Preprocessing improves accuracy and reduces algorithm complexity.
3. Data Cleaning
Data cleaning removes noise, inconsistencies, and handles missing values.
Handling Missing Values:
Method Description Example
Ignore Record Remove tuples with missing values. Drop student record with missing grade.
Fill Constant Replace with fixed value. Missing city → 'Unknown'.
Mean/Median/Mode Statistical replacement. Income replaced with mean salary.
Predictive Models Use ML models for estimation. Predict missing age using regression.
Handling Noisy Data:
Binning
Noisy Data Regression
Clustering
4. Data Transformation
Transforms data into suitable formats for mining.
Method Formula Example
Min-Max (x - min)/(max - min) Score normalization from 0-100 → 0-1
Z-Score (x - mean)/std Standardize exam scores.
Decimal Scaling Move decimal point Income 12345 → 12.345
• Aggregation: Summarizing daily → monthly data.
• Generalization: Replace values with higher-level concepts.
• Attribute Construction: Creating new attributes (e.g., BMI).
5. Data Reduction
Reduce data volume while preserving analytical value.
• Dimensionality Reduction → PCA, attribute selection.
• Data Compression → encoding (wavelet, Huffman).
• Numerosity Reduction → histograms, regression, clustering.
Original Data Reduced Data
6. Data Mining Perspectives
• Task Relevant Data: Selecting only required attributes.
• Kinds of Knowledge: Predictive (classification, regression), Descriptive (clustering, association).
• Discretization: Continuous → categorical (e.g., Age: Young, Adult, Senior).
• Concept Hierarchy: Levels of abstraction (City → State → Country).
Country
State
City
7. Quick Revision Summary
• Preprocessing = Cleaning + Transformation + Reduction.
• Data Cleaning → Handle missing values, noise, inconsistencies.
• Data Transformation → Normalization, aggregation, generalization.
• Data Reduction → Dimensionality reduction, compression, numerosity reduction.
• Data Mining Perspectives → Task relevant data, knowledge to be mined, discretization,
concept hierarchies.