Data Mining Information
Data Mining Information
Data Mining Information
Define data mining. Why are there many different names and definitions for data
mining?
Data mining is the process through which previously unknown patterns in data
were discovered. Another definition would be a process that uses statistical,
mathematical, artificial intelligence, and machine learning techniques to extract
and identify useful information and subsequent knowledge from large databases.
This includes most types of automated data analysis. A third definition: Data
mining is the process of finding mathematical patterns from (usually) large sets of
data; these can be rules, affinities, correlations, trends, or prediction models.
Data mining has many definitions because its been stretched beyond those limits
by some software vendors to include most forms of data analysis in order to
increase sales using the popularity of data mining.
What recent factors have increased the popularity of data mining?
Following are some of most pronounced reasons:
More intense competition at the global scale driven by customers everchanging needs and wants in an increasingly saturated marketplace.
Significant reduction in the cost of hardware and software for data storage
and processing.
patterns have been manually extracted from data by humans for centuries, but the
increasing volume of data in modern times has created a need for more automatic
approaches. As datasets have grown in size and complexity, dir ct manual data
analysis has increasingly been augmented with indirect, automatic data processing
tools that use sophisticated methodologies, methods, and algorithms. The
manifestation of such evolution of automated and semiautomated means of
processing large datasets is now commonly referred to as data mining.
Identify at least five specific applications of data mining and list five common
characteristics of these applications.
This question expands on the prior question by asking for common characteristics.
Several such applications and their characteristics are listed on (pp. 145-147):
CRM, banking, retailing and logistics, manufacturing and production, brokerage,
insurance, computer hardware and software, government, travel, healthcare,
medicine, entertainment, homeland security, and sports.
What do you think is the most prominent application area for data mining? Why?
Students answers will differ depending on which of the applications (most likely
banking, retailing and logistics, manufacturing and production, government,
healthcare, medicine, or homeland security) they think is most in need of greater
certainty. Their reasons for selection should relate to the application areas need
for better certainty and the ability to pay for the investments in data mining.
Can you think of other application areas for data mining not discussed in this section?
Explain.
Students should be able to identify an area that can benefit from greater prediction
or certainty. Answers will vary depending on their creativity.
Why do you think the early phases (understanding of the business and understanding of
the data) take the longest in data mining projects?
Students should explain that the early steps are the most unstructured phases
because they involve learning. Those phases (learning/understanding) cannot be
automated. Extra time and effort are needed upfront because any mistake in
understanding the business or data will most likely result in a failed BI project.
List and briefly define the phases in the CRISP-DM process.
CRISP-DM provides a systematic and orderly way to conduct data mining
projects. This process has six steps. First, an understanding of the data and an
understanding of the business issues to be addressed are developed concurrently.
Next, data are prepared for modeling; are modeled; model results are evaluated;
and the models can be employed for regular use.
What are the main data preprocessing steps? Briefly describe each step and provide
relevant examples.
Data preprocessing is essential to any successful data mining study. Good data
leads to good information; good information leads to good decisions. Data
preprocessing includes four main steps (listed in Table 4.4 on page 153):
data consolidation: access, collect, select and filter data
data cleaning: handle missing data, reduce noise, fix errors
data transformation: normalize the data, aggregate data, construct new attributes
data reduction: reduce number of attributes and records; balance skewed data
How does CRISP-DM differ from SEMMA?
The main difference between CRISP-DM and SEMMA is that CRISP-DM takes a
more comprehensive approachincluding understanding of the business and the
relevant datato data mining projects, whereas SEMMA implicitly assumes that
the data mining projects goals and objectives along with the appropriate data
sources have been identified and understood.
Rough sets. This method takes into account the partial membership of
class labels to predefined categories in building models (collection of
rules) for classification problems.
What are some of the criteria for comparing and selecting the best classification
technique?
What are the commonalities and differences between biological and artificial neural
networks?
Biological neural networks are composed of many massively interconnected
neurons. Each neuron possesses axons and dendrites, fingerlike projections that
enable the neuron to communicate with its neighboring neurons by transmitting
and receiving electrical and chemical signals. More or less resembling the
structure of their biological counterparts, ANN are composed of interconnected,
simple processing elements (PE) called artificial neurons.
What is a neural network architecture? What are the most common neural network
architectures?
Each ANN is composed of a collection of neurons (or PE) that are grouped into
layers. Several hidden layers can be placed between the input and output layers,
although it is common to use only one hidden layer. This layered structure of
ANN is commonly called as multi-layered perceptron (MLP). MLP architecture
is known to produce highly accurate prediction models for both classification as
well as regression type prediction problems. In addition to MLP, ANN also has
other architectures such as Kohonens self-organizing feature maps (commonly
used for clustering type problems), Hopfield network (used to solve complex
computational problems), recurrent networks (as opposed to feedforward, this
architecture allows for backward connections as well), and probabilistic networks
(where the weights are adjusted based on the statistical measures obtained from
the training data).
How does an MLP type neural network learn?
Backpropagation is the learning mechanism for feedforward MLP networks. It
follows an iterative process where the difference between the network output and
the desired output is fed back to the network so that the network weights would
gradually be adjusted to produce outputs closer to the actual values.
Why do you think the most popular tools are developed by statistics companies?
Data mining techniques involve the use of statistical analysis and modeling. So
its a natural extension of their business offerings.
What are the most popular free data mining tools?
Probably the most popular free and open source data mining tool is Weka. Others
include RapidMiner, and Microsofts SQL Server.
What are the main differences between commercial and free data mining software tools?
The main difference between commercial tools, such as Enterprise Miner, PASW,
and Statistica, and free tools, such as Weka and RapidMiner, is computational
efficiency. The same data mining task involving a rather large dataset may take a
whole lot longer to complete with the free software, and in some cases it may not
even be feasible (i.e., crashing due to the inefficient use of computer memory).
What would be your top five selection criteria for a data mining tool? Explain.
Students answers will differ. Criteria they are likely to mention include cost, userinterface, ease-of-use, computational efficiency, hardware compatibility, type of
business problem, vendor support, and vendor reputation.
Data mining is only for large firms that have lots of customer data.
What do you think are the reasons for these myths about data mining?
Students answers will differ. Some answers might relate to fear of analytics, fear
of the unknown, or fear of looking dumb.
What are the most common data mining mistakes? How can they be minimized and/or
eliminated?
1. Selecting the wrong problem for data mining
Copyright 2011 Pearson Education, Inc. publishing as Prentice Hall
10
2. Ignoring what your sponsor thinks data mining is and what it really can
and cannot do
3. Leaving insufficient time for data preparation. It takes more effort than
one often expects
4. Looking only at aggregated results and not at individual records
5. Being sloppy about keeping track of the mining procedure and results
6. Ignoring suspicious findings and quickly moving on
7. Running mining algorithms repeatedly and blindly (It is important to think
hard enough about the next stage of data analysis. Data mining is a very
hands-on activity.)
8. Believing everything you are told about data
9. Believing everything you are told about your own data mining analysis
10. Measuring your results differently from the way your sponsor measures
them
Ways to minimize these risks are basically the reverse of these items.
Define data mining. Why are there many names and definitions for data mining?
Data mining is the process through which previously unknown patterns in data
were discovered. Another definition would be a process that uses statistical,
mathematical, artificial intelligence, and machine learning techniques to extract
and identify useful information and subsequent knowledge from large databases.
This includes most types of automated data analysis. A third definition: Data
mining is the process of finding mathematical patterns from (usually) large sets of
data; these can be rules, affinities, correlations, trends, or prediction models.
Data mining has many definitions because its been stretched beyond those limits
by some software vendors to include most forms of data analysis in order to
increase sales using the popularity of data mining.
What are the main reasons for the recent popularity of data mining?
Following are some of most pronounced reasons:
11
More intense competition at the global scale driven by customers everchanging needs and wants in an increasingly saturated marketplace
Significant reduction in the cost of hardware and software for data storage
and processing
Discuss the main data mining methods. What are the fundamental differences
among them?
Three broad categories of data mining methods are prediction (classification or
regression), clustering, and association.
Prediction is the act of telling about the future. It differs from simple guessing
by taking into account the experiences, opinions, and other relevant information
in conducting the task of foretelling. A term that is commonly associated with
prediction is forecasting. Even though many believe that these two terms are
synonymous, there is a subtle but critical difference between the two. Whereas
prediction is largely experience and opinion based, forecasting is data and model
based.
12
What are the main data mining application areas? Discuss the commonalities of
these areas that make them a prospect for data mining studies.
Applications are listed near the beginning of this section (pp. 145-147): CRM,
banking, retailing and logistics, manufacturing and production, brokerage,
insurance, computer hardware and software, government, travel, healthcare,
medicine, entertainment, homeland security, and sports.
The commonalities are the need for predictions and forecasting for planning
purposes and to support decision making.
Why do we need a standardized data mining process? What are the most
commonly used data mining processes?
In order to systematically carry out data mining projects, a general process is
usually followed. Similar to other information systems initiatives, a data mining
project must follow a systematic project management process to be successful.
Several data mining processes have been proposed: CRISP-DM, SEMMA, and
KDD.
Discuss the differences between the two most commonly used data mining
process.
The main difference between CRISP-DM and SEMMA is that CRISP-DM takes a
more comprehensive approachincluding understanding of the business and the
relevant datato data mining projects, whereas SEMMA implicitly assumes that
Copyright 2011 Pearson Education, Inc. publishing as Prentice Hall
13
the data mining projects goals and objectives along with the appropriate data
sources have been identified and understood.
9
14
Why do you think that consulting companies are more likely to use data mining
tools and techniques? What specific value proposition do they offer?
Consulting companies use data mining tools and techniques because the results
are valuable to their clients. Consulting companies can develop data mining
expertise and invest in the hardware and software and then earn a return on those
investments by selling those services. Data mining can lead to insights that
provide a competitive advantage to their clients.
15
What was the problem that argonauten360 helped solve for a call-by-call
provider?
It is a very competitive business, and the success of the call-by-call
telecommunications provider depends greatly on attractive per-minute calling
rates. Rankings of those rates are widely published, and the key is to be ranked
somewhere in the top-five lowest-cost providers while maintaining the best
possible margins.
Can you think of other problems for telecommunication companies that are likely
to be solved with data mining?
Predicting the volume of calls for customer service based on time of day.