What is Data Mining and KDD

I am very interested in processes. I want to know good ways to do things, even the best way to do things if possible. Even if you don’t have skill or deep understanding, process can get you a long way. It can lead the way and skill and deep understanding can follow. At least, I have used this to drive much of my work.

I think it’s useful to study data mining as it is presented as a process for making discoveries from data. In this post you will explore authoritative definitions for “Data Mining” from textbooks and papers. As data mining is a process, the definition will include a number of interpretations of the process.

Gold Mine

Gold Mine
Photo credit GSofV, some rights reserved

Authoritative Textbooks

In this section we will look at definitions of “data mining” from two authoritative textbooks in the field.

Data Mining: Practical Machine Learning Tools and Techniques

Amazon ImageThis is a textbook by Ian Witten and Eibe Frank.

From the preface, the authors comment:

“Data mining is the extraction of implicit, previously unknown, and potentially useful information from data. The idea is to build computer programs that sift through databases automatically, seeking regularities or patterns. Strong patterns, if found, will likely generalize to make accurate predictions on future data. … Machine learning provides the technical basis for data mining. It is used to extract information from the raw data in databases…”

In chapter 1 of the book, the authors write:

“Data mining is defined as the process of discovering patterns in data. The process must be automatic or (more usually) semiautomatic. The patterns discovered must be meaningful in that hey lead to some advantage, usually an economic one. The data is invariably present in substantial quantities.”

I read this book early in my entry into the field and this definition of data mining and its relationship with machine learning has stuck with me. When I apply machine learning methods, I apply a process that looks like the data mining process, except I am not trying to discover patterns per se, rather I am trying to find a “good enough” solution to a well defined problem.

Data Mining: Concepts and Techniques

Amazon ImageThis is a textbook by Jiawei Han and Micheline Kamber.

In the preface the authors write:

“Data mining, also popularly referred to as knowledge discovery from data (KDD), is the automated or convenient extraction of patterns representing knowledge implicitly stored or captured in large databases, data warehouses, the Web, other massive information repositories or data streams.”

This is a slightly different definition of KDD that I believe is standard in the field. I believe the preferred definition of KDD is Knowledge Discovery in Databases.

In chapter 1, the authors summarize the KDD process (pages 7 and 8):

  1. Data cleaning to remove noise and inconsistent data.
  2. Data integration, where multiple data sources may be combined.
  3. Data selection, where data relevant to the analysis task are retrieved from the database.
  4. Data transformation, where data are transformed and consolidated into forms appropriate for mining by preforming summary or aggregation operations.
  5. Data mining, which is an essential process where intelligent methods are applied to extract data patterns.
  6. Pattern evaluation to identify the truly interesting patterns representing knowledge based on interesting measures.
  7. Knowledge presentation, where visualization and knowledge representation techniques are used to present mined knowledge to users.

In this book, the authors comment that data mining more commonly refers to the whole Knowledge Discovery from Data process, probably because it is a shorter term.

Authoritative Articles

In this section we will explore the process of Knowledge Discovery in Databases (KDD) in authoritative articles in the field. These are both articles in repreitable technical macgainzes rather than peer reviewed journal articles. Nevertheless, the less formal tone allows for a useful discussion of this high-level topic.

From Data Mining to Knowledge Discovery in Databases

This was an article in AI Magazine in 1996 by Usama Fayyad, Gregory Piatetsky-Shapiro and Padhraic Smyth.

They define KDD as Knowledge Discovery in Databases and this is a definition I am more familiar with:

“… the KDD field is concerned with the development of methods and techniques for making sense of data. … At the core of the process is the application of specific data-mining methods for pattern discovery and extraction.”

and

“… KDD refers to the overall process of discovering useful knowledge from data, and data mining refers to a particular step in this process. Data mining is the application of specific algorithms for extracting patterns from data.”

The authors provide a useful summarization of KDD in a picture with entities in boxes and processes that connect the boxes as transforms on entities. This depiction is summarized below . I am reticent to reproduce the image, sorry, formal publications can be difficult in this regard.

  • Step 1: Selection (data into target data)
  • Step 2: Preprocessing (target data into processed data)
  • Step 3: Transformation (processed data into transformed data)
  • Step 4: Data Mining (transformed data into patterns)
  • Step 5: Interpretation and/or Evaluation patterns into knowledge)

This process is simple and it is the model that I like to use when working on a problem.

The KDD Process for Extracting Useful Knowledge from Volumes of Data

This was an article in the Communications of he ACM in 1996 by Usama Fayyad, Gregory Piatetsky-Shapiro and Padhraic Smyth.

In this article, the authors give a more detailed summary of the KDD process. This more detailed version was in the “From Data Mining…” article above but I felt was less clearly presented. This more detailed summary of the KDD process is paraphrased below.

  1. Understand the application domain and the goal of the process
  2. Create target dataset as a subset of all the data that is available
  3. Data cleaning and preprocessing to remove noise, handling missing data and outliers
  4. Data reduction and projection in order to focus on the features that are relevant to the problem
  5. Match goals of process to a data mining method. Decide the purpose of the model such as summarization or classification.
  6. Choose the data mining algorithms to match the purpose of the model (from step 5)
  7. Data mining, i.e. run algorithms on data.
  8. Interpretation of mined patterns to make them understandable by the user, such as summarization and visualization.
  9. Acting on the discovered knowledge, such as reporting or making decisions.

I like the detail in this process. It really spells out the need to understand the objectives of the process and enduring the algorithm selected matches those objectives.

Summary

In this post you learned that data mining is the discovery of patterns from data. You learned that it is a process that is comprised of a number of steps that cover data preparation, the running of algorithms and the presentation of results.

You learned that machine learning are the tools used in data mining and that data mining is really a step in the process of Knowledge Discovery in Databases or KDD and that it has come to be synonymous with the term because it is easier to say.

You learned that when you are working on a machine learning project, that you are likely performing some form of the KDD process with the specific objective of solving a problem rather than making a discovery.

Resources

If you would like to dive deeper  you can read more into the resources used in the research for this post below.

How do you understand data mining and how machine learning fits in? Leave a comment and share your experiences.

12 Responses to What is Data Mining and KDD

  1. Solange February 20, 2016 at 2:48 am #

    As a PhD scientist with food technology background, I now need to learn alone how to classify multimodal complex signals into texture quality groups. I thank you for making the terms I look for very clear so that I will avoid making errors. I used to look for data mining but KDD is rather what I am doing.

    But I wonder if I should use data mining as you tell us that many people use it for KDD because of practicity. On my poster, “knowledge discovery” seams clear but strangely formulated, and data mining is not easy to understand for everyone. But I think we should use the correct definitions to avoid confusion.

  2. Jayesh September 26, 2016 at 5:57 pm #

    Thanks for clear definitions. This really helped me to understand the process and flow to be followed.

    • Jason Brownlee September 27, 2016 at 7:40 am #

      I’m glad to hear it Jayesh.

      • Basant kumar March 11, 2017 at 1:58 am #

        I like your mining concept and process very much. Thank you to help us.

  3. Basant kumar March 11, 2017 at 1:54 am #

    I like your mining concept and process very much. Thank you to help us.

  4. Arturo June 18, 2017 at 6:10 pm #

    Totally agreed with the definition and clarification.

    Try not to use the concepts interchangeably, KDD is not the same as Data Mining.

  5. Ankit Giri January 23, 2018 at 2:53 pm #

    Thank you sir for brief definitions

  6. Jesús Martínez February 14, 2018 at 11:43 am #

    Thank you very much for shedding more light on KDD and data mining. Inadvertently, the whole data cleaning, preprocessing and summarization I use resembles KDD very closely. It is good to know that it is part of another more formal, well-defined process.

Leave a Reply