Data Mining: Concepts & Techniques
Data Mining: Concepts & Techniques
Data Mining: Concepts & Techniques
We are drowning in data, but starving for knowledge! Solution: Data warehousing and data mining
Data warehousing and on-line analytical processing Extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases
Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation:
Find useful features, dimensionality/variable reduction, invariant representation.
Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns, etc.
Questions:
What is the profile of a reader of a car magazine? Is there any correlation between an interest in cars and an interest in comics?
Data Selection
Select the information about people who have subscribed to a magazine
Cleaning
Pollutions: Type errors, moving from one place to another without notifying change of address, people give incorrect information about themselves
Pattern Recognition Algorithms
Cleaning
Lack of domain consistency
Enrichment
Need extra information about the clients consisting of date of birth, income, amount of credit, and whether or not an individual owns a car or a house
Enrichment
The new information need to be easily joined to the existing client records
Extract more knowledge
Coding
We select only those records that have enough information to be of value (row) Project the fields in which we are interested (column)
Coding
Code the information which is too detailed
Address to region Birth date to age Divide income by 1000 Divide credit by 1000 Convert cars yes-no to 1-0 Convert purchase date to month numbers starting from 1990
The way in which we code the information will determine the type of patterns we find Coding has to be performed repeatedly in order to get the best results
Coding
The way in which we code the information will determine the type of patterns we find
Coding
We are interested in the relationships between readers of different magazines
Perform flattening operation
Data mining
We may find the following rules
A customer with credit > 13000 and aged between 22 and 31 who has subscribed to a comics at time T will very likely subscribe to a car magazine five years later The number of house magazines sold to customers with credit between 12000 and 31000 living in region 4 is increasing A customer with credit between 5000 and 10000 who reads a comics magazine will very likely become a customer with credit between 12000 and 31000 who reads a sports and a house magazine after 12 years
Business-Question-Driven Process
End User
Visualization Techniques
Data Mining Information Discovery Data Exploration Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts OLAP, MDA Data Sources Paper, Files, Information Providers, Database Systems, OLTP
Business Analyst
Data Analyst
DBA