I am very interested in processes. I want to know good ways to do things, even the best way to do things if possible. Even if you don’t have skill or deep understanding, process can get you a long way. It can lead the way and skill and deep understanding can follow. At least, I have used this to drive much of my work.
I think it’s useful to study data mining as it is presented as a process for making discoveries from data. In this post you will explore authoritative definitions for “Data Mining” from textbooks and papers. As data mining is a process, the definition will include a number of interpretations of the process.
In this section we will look at definitions of “data mining” from two authoritative textbooks in the field.
This is a textbook by Ian Witten and Eibe Frank.
From the preface, the authors comment:
“Data mining is the extraction of implicit, previously unknown, and potentially useful information from data. The idea is to build computer programs that sift through databases automatically, seeking regularities or patterns. Strong patterns, if found, will likely generalize to make accurate predictions on future data. … Machine learning provides the technical basis for data mining. It is used to extract information from the raw data in databases…”
In chapter 1 of the book, the authors write:
“Data mining is defined as the process of discovering patterns in data. The process must be automatic or (more usually) semiautomatic. The patterns discovered must be meaningful in that hey lead to some advantage, usually an economic one. The data is invariably present in substantial quantities.”
I read this book early in my entry into the field and this definition of data mining and its relationship with machine learning has stuck with me. When I apply machine learning methods, I apply a process that looks like the data mining process, except I am not trying to discover patterns per se, rather I am trying to find a “good enough” solution to a well defined problem.
This is a textbook by Jiawei Han and Micheline Kamber.
In the preface the authors write:
“Data mining, also popularly referred to as knowledge discovery from data (KDD), is the automated or convenient extraction of patterns representing knowledge implicitly stored or captured in large databases, data warehouses, the Web, other massive information repositories or data streams.”
This is a slightly different definition of KDD that I believe is standard in the field. I believe the preferred definition of KDD is Knowledge Discovery in Databases.
In chapter 1, the authors summarize the KDD process (pages 7 and 8):
- Data cleaning to remove noise and inconsistent data.
- Data integration, where multiple data sources may be combined.
- Data selection, where data relevant to the analysis task are retrieved from the database.
- Data transformation, where data are transformed and consolidated into forms appropriate for mining by preforming summary or aggregation operations.
- Data mining, which is an essential process where intelligent methods are applied to extract data patterns.
- Pattern evaluation to identify the truly interesting patterns representing knowledge based on interesting measures.
- Knowledge presentation, where visualization and knowledge representation techniques are used to present mined knowledge to users.
In this book, the authors comment that data mining more commonly refers to the whole Knowledge Discovery from Data process, probably because it is a shorter term.
In this section we will explore the process of Knowledge Discovery in Databases (KDD) in authoritative articles in the field. These are both articles in repreitable technical macgainzes rather than peer reviewed journal articles. Nevertheless, the less formal tone allows for a useful discussion of this high-level topic.
From Data Mining to Knowledge Discovery in Databases
This was an article in AI Magazine in 1996 by Usama Fayyad, Gregory Piatetsky-Shapiro and Padhraic Smyth.
They define KDD as Knowledge Discovery in Databases and this is a definition I am more familiar with:
“… the KDD field is concerned with the development of methods and techniques for making sense of data. … At the core of the process is the application of specific data-mining methods for pattern discovery and extraction.”
“… KDD refers to the overall process of discovering useful knowledge from data, and data mining refers to a particular step in this process. Data mining is the application of specific algorithms for extracting patterns from data.”
The authors provide a useful summarization of KDD in a picture with entities in boxes and processes that connect the boxes as transforms on entities. This depiction is summarized below . I am reticent to reproduce the image, sorry, formal publications can be difficult in this regard.
- Step 1: Selection (data into target data)
- Step 2: Preprocessing (target data into processed data)
- Step 3: Transformation (processed data into transformed data)
- Step 4: Data Mining (transformed data into patterns)
- Step 5: Interpretation and/or Evaluation patterns into knowledge)
This process is simple and it is the model that I like to use when working on a problem.
The KDD Process for Extracting Useful Knowledge from Volumes of Data
This was an article in the Communications of he ACM in 1996 by Usama Fayyad, Gregory Piatetsky-Shapiro and Padhraic Smyth.
In this article, the authors give a more detailed summary of the KDD process. This more detailed version was in the “From Data Mining…” article above but I felt was less clearly presented. This more detailed summary of the KDD process is paraphrased below.
- Understand the application domain and the goal of the process
- Create target dataset as a subset of all the data that is available
- Data cleaning and preprocessing to remove noise, handling missing data and outliers
- Data reduction and projection in order to focus on the features that are relevant to the problem
- Match goals of process to a data mining method. Decide the purpose of the model such as summarization or classification.
- Choose the data mining algorithms to match the purpose of the model (from step 5)
- Data mining, i.e. run algorithms on data.
- Interpretation of mined patterns to make them understandable by the user, such as summarization and visualization.
- Acting on the discovered knowledge, such as reporting or making decisions.
I like the detail in this process. It really spells out the need to understand the objectives of the process and enduring the algorithm selected matches those objectives.
In this post you learned that data mining is the discovery of patterns from data. You learned that it is a process that is comprised of a number of steps that cover data preparation, the running of algorithms and the presentation of results.
You learned that machine learning are the tools used in data mining and that data mining is really a step in the process of Knowledge Discovery in Databases or KDD and that it has come to be synonymous with the term because it is easier to say.
You learned that when you are working on a machine learning project, that you are likely performing some form of the KDD process with the specific objective of solving a problem rather than making a discovery.
If you would like to dive deeper you can read more into the resources used in the research for this post below.
- Data Mining: Practical Machine Learning Tools and Techniques
- Data Mining: Concepts and Techniques
- From Data Mining to Knowledge Discovery in Databases, 1996
- The KDD Process for Extracting Useful Knowledge from Volumes of Data, 1996
How do you understand data mining and how machine learning fits in?
Leave a comment and share your experiences.