Data Mining FAQ

Question: What is data mining?

Data mining is the semi-automatic extraction of patterns, changes, associations, anomalies, and other statistically significant structures from large data sets.

Question: Why is data mining important?

There is more and more digital data being collected, processed, managed and archived every day. Algorithms, software tools, and systems to mine it are critical to a wide variety of problems in business, science, national defense, engineering, and health care.

Question: What are some commerical success stories in data mining?

Data mining has been applied successfully in a number of different fields, including: a) for detecting credit card fraud by HNC, which was recently acquired by FICO; b) in credit card acquisition and risk management by American Express; and c) for product recommendations by Amazon.

Question: What are the historical roots of data mining?

From a business perspective, data mining's roots are in direct marketing and financial services which have used statistical modeling for at least the past two decades. From a technical perspective, data mining is beginning to emerge as a separate discipline with roots in a) statistics, b) machine learning, c) databases, and d) high performance computing.

Q. What are some of the different techniques used in data mining?

There are several different types of data mining, including:

  1. Predictive models. These types of models predict how likely an event is. Usually, the higher a score, the more likely the event is. For example, how likely a credit card transaction is to be fraudulent, or how likely an airline passenger is to be a terrorist, or how likely a company is to go bankrupt.
  2. Summary models. These models summarize data. For example, a cluster model can be used to divide credit card transactions or airline passengers into different groups depending upon their characteristics.
  3. Network models. These types of models uncover certain structures in data represented by nodes and links. For example, a credit card fraud ring may surreptitiously collect credit card numbers at a pawn shop and then use them for online computer purchases. Here the nodes are consumers and merchants and the links are credit card transactions. Similarly a network model for a terrorist cell might use nodes representing individuals and links representing meetings.
  4. Association models. Sometimes certain events occur frequently together. For example, purchases of certain items, such as beer and pretzels, or a sequence of events associated with component failure. Association models are used to find and characterize these co-occurrences.

Q. What are the major steps in data mining?

  1. Data cleaning. The first and most challenging step is to clean and to prepare the data for data mining and statistical modeling. This is usually the most challenging step.
  2. Data mart. The next step is to create a data mart containing the cleaned and prepared data.
  3. Derived attributes. It is rare for a model to built using only the attributes present in the cleaned data; rather, additional attributes called derived attributes are usually defined. As a single example, a stock on the S&P 500 has a price and an earnings associated with it, but the ratio of the price divided by the earnings is more important for many applications than either single attribute considered by itself. The construction of the derived and data attributes from the raw data is sometimes called shaping the data. Standards, such as the Data Extraction and Transformation Markup Language (DXML), are beginning to emerge for defining the common data shaping operations needed in data mining.

  4. Modeling. Once the data is prepared and data mart is created, one or more statistical or data mining models are built. Today, statistical and data mining models can be described in an application and platform independent XML interchange format called the Predictive Model Markup Language or PMML.
  5. Post-processing. It is common to normalize the outputs of data mining models and to apply business rules to the inputs and the outputs of the models. This is to ensure that the scores and other outputs of the models are consistent with the over all business processes the models are supporting.
  6. Deployment. Once a statistical or data mining model has been produced by the steps above, the next phase begins of deploying the model in operational systems. Deployment usually consists of three different activities. First, data is scored using the statistical or data mining model produced on a periodic basis, either daily, weekly or monthly, or perhaps on a real time, or event driven basis. Second, these scores are deployed into operational systems and also used as the basis for various reports. Third, on a periodic basis, say monthly, a new model is built and compared to the existing model. If required, the old model is replaced by the new model.

Q. What are the differences between predictive models, business rules, and score cards?

Predictive models use historical data to predict future events, for example the likelihood that a credit card transaction is fraudulent or that an airline passenger is likely to commit a terrorist act. Business rules ensure that business processes follow agreed upon procedures. For example, business procedures may dictate that a predictive model can use only the first three digits of a zip code not all five digits. Score cards check certain conditions, and for example, and if these conditions are met, points are added to an overall score. For example, a score card for a credit card fraud model, might add 28 points if a $1 transaction occurs at a gas station. The higher the score, the more likely the credit card transaction is fraudulent. The best practice is to use both rules and scores. Rules ensure that business processes are being followed and predictive models ensure that historical data is being used most effectively.

Score cards are typically used for very basic systems which use just a few simple rules or for historical reasons. For example, the credit scoring reason has used score cards for many years - these score cards though use statistical models to determine the conditions and corresponding scores.

Q. What determines the accuracy of predictive models?

The accuracy of a predictive model is influenced most strongly by the quality of the data and the freshness of the model. Without good data, it is simply wishful thinking to expect a good model. Without updating the model frequently, the model's performance will decay over time.

Accuracy is measured in two basic ways. Models have false positive rates and false negative rates. For example, consider a model predicting credit card fraud. A false positive means that the model predicted fraud when no fraud was present. A false negative means that the model predicted that the transaction was ok when in fact it was fraudulent. In practice, false positive and false negative rates can be relatively high. The role of a good model is to improve a business process by a significant degree not to make flawless predictions. Only journalists and pundits make flawless predictions.

Best practice uses separate, specialized software applications for building models (the model producer) and for scoring models (the model consumer). The Predictive Model Markup Language or PMML is the industry standard for describing a model in XML so that it can be moved easily between a model producer and a model consumer. Good accuracy require fresh models on fresh data, which means updating the model consumer as frequently as the data demands.

Q. What are the major types of predictive models?

Although there are quite a large number of different types of predictive models, the majority of applications use one of the following types of models.

  1. Linear models. For many years, especially before the advent of personal computers, these were the most common types of models due to their simplicity. They divide data into two different cells using a line in two dimensions and a plane in higher dimensions. Quadratic models are similar but use a curve instead of a line to divide the data.
  2. Logistic models. Logistic models are used when the predicted variable is zero or one, for example predicting that a credit card transaction is fraudulent or not. Logistic models assume that one of the internal components of the model is linear. Computing the weights that characterize a logistic model is difficult by hand, but simple with a computer.
  3. Neural Networks. Neural networks are a type of nonlinear model broadly motivated ("inspired by" is the phrase Hollywood uses) by neurons in brains.
  4. Trees. Trees are a type of nonlinear model which uses a series of lines or planes to divide the data into different cells. Trees consist of a sequence of if...then.. rules. Because of this, it is easier to interpret trees than other types of nonlinear models such as neural networks.
  5. Hybrid Models. It is common to combine one or more of the four models above to produce a more powerful model.

Q. What is the difference between a linear and nonlinear model?

Models can be thought of as a function, which takes inputs, performs a computation, and produces an output. The output is often a score, say from 1 to 1000, or a label, such such as high, medium, or low. A very simple type of model, called a linear model, uses the n input features to split the space of features into two parts. This is done using an (n-1)-dimensional plane. For example, 2 features can be separated with a line, 3 features with a plane, etc. Most data is not so simple. Any model which is not linear is called a nonlinear model. Logistic models, tree based models and neural networks are common examples of nonlinear models.

Q. What are the some of the differences between the various types of predictive models?

First, there is no one best model. Different data requires different types of models. The accuracy of a model depends more on the quality of the data, how well it is prepared, and how fresh the model is than on the type of model used. On the other hand, there are some important differences between different types of models. Nonlinear models are generally more accurate than linear models. Linear models were more common in the past because they were easier to compute. Today this is no longer relevant given the proliferation of computers and good quality statistical and data mining software. Neural networks were very popular in the 80's and early 90's because they were quite successful for several different types of applications and because they had a cool name. Today, they are being replaced by tree-based methods, which are generally considered easier to build, easier to interpret, and more scalable.

Q. I hear the phrase "empirically derived and statistically valid" applied to models. What does that mean?

Decisions based upon models derived from data are usually expected to be empirically derived and statistically sound. That is, first, they must be derived from the data itself, and not the biases of the person building the model. Second, they must be based upon generally acceptable statistical procedures. For example, the arbitrary exclusion of data can result in models that are biased in some fashion.

Q. What are some of the major components in a data mining system?

Assume that the function of the data mining system is to assign scores to various profiles. For example, profiles may be maintained about companies and the scores used to indicate the likelihood that the company will go bankrupt. Alternatively, the profiles may be maintained for customer accounts and the scores indiciate the likelihood that the account is being used fradulently. A typical data mining system processes raw transactional data, consisting of what are called events, to produce the profiles. To continue the examples above, the events may consist of survey data about the companies, or purchases by the customer.

First, a data mart is used to store the event and profile data which is used to build the predictive models. For large data sets, the data mart must be designed for efficient statistics on columns rather than simple counting and summaries like a conventional data warehouse, or safe updating of rows, like a conventional database.

Second, a data mining system takes data from the data mart and applies statistical or data mining algorithms to produce a model. More precisely, the data mining system takes a learning set of profiles and produces a statistical model.

Third, an operational data store or operational database is used to store profiles. A profile is a statistical summary of the entity being model and typically contains dozens to hundreds of features. A relational database is generally used for the operational data store.

Fourth, the scoring software takes a model produced by the data mining system, and a profile from the operational data store and produce one or more scores. These scores can either be used to produce reports or deployed into operation systems.

Fifth, the reports generated are generally made available through a reporting system.

For smaller applications, a database can be used for the data mart and operational data store, and the reports can be produced in HTML and made available through a web server.

Q. Who is the author of this FAQ?

This FAQ is maintained by Robert L. Grossman.


Copyright Robert L. Grossman, 1999-2005, revised January 2, 2005.


This is from www.rgrossman.com