Data Leakage in Machine Learning

Data leakage is a big problem in machine learning when developing predictive models.

Data leakage is when information from outside the training dataset is used to create the model.

In this post you will discover the problem of data leakage in predictive modeling.

After reading this post you will know:

  • What is data leakage is in predictive modeling.
  • Signs of data leakage and why it is a problem.
  • Tips and tricks that you can use to minimize data leakage on your predictive modeling problems.

Let’s get started.

Data Leakage in Machine Learning

Data Leakage in Machine Learning
Photo by DaveBleasdale, some rights reserved.

Goal of Predictive Modeling

The goal of predictive modeling is to develop a model that makes accurate predictions on new data, unseen during training.

This is a hard problem.

It’s hard because we cannot evaluate the model on something we don’t have.

Therefore, we must estimate the performance of the model on unseen data by training it on only some of the data we have and evaluating it on the rest of the data.

This is the principle that underlies cross validation and more sophisticated techniques that try to reduce the variance in this estimate.

What is Data Leakage in Machine Learning?

Data leakage can cause you to create overly optimistic if not completely invalid predictive models.

Data leakage is when information from outside the training dataset is used to create the model. This additional information can allow the model to learn or know something that it otherwise would not know and in turn invalidate the estimated performance of the mode being constructed.

if any other feature whose value would not actually be available in practice at the time you’d want to use the model to make a prediction, is a feature that can introduce leakage to your model

Data Skeptic

when the data you are using to train a machine learning algorithm happens to have the information you are trying to predict

— Daniel Gutierrez, Ask a Data Scientist: Data Leakage

There is a topic in computer security called data leakage and data loss prevention which is related but not what we are talking about.

Data Leakage is a Problem

It is a serious problem for at least 3 reasons:

  1. It is a problem if you are running a machine learning competition. Top models will use the leaky data rather than be good general model of the underlying problem.
  2. It is a problem when you are a company providing your data. Reversing an anonymization and obfuscation can result in a privacy breach that you did not expect.
  3. It is a problem when you are developing your own predictive models. You may be creating overly optimistic models that are practically useless and cannot be used in production.

As machine learning practitioners, we are primarily concerned with this last case.

Do I have Data Leakage?

An easy way to know you have data leakage is if you are achieving performance that seems a little too good to be true.

Like you can predict lottery numbers or pick stocks with high accuracy.

“too good to be true” performance is “a dead giveaway” of its existence

— Chapter 13, Doing Data Science: Straight Talk from the Frontline

Data leakage is generally more of a problem with complex datasets, for example:

  • Time series datasets when creating training and test sets can be difficult.
  • Graph problems where random sampling methods can be difficult to construct.
  • Analog observations like sound and images where samples are stored in separate files that have a size and a time stamp.

Techniques To Minimize Data Leakage When Building Models

Two good techniques that you can use to minimize data leakage when developing predictive models are as follows:

  1. Perform data preparation within your cross validation folds.
  2. Hold back a validation dataset for final sanity check of your developed models.

Generally, it is good practice to use both of these techniques.

1. Perform Data Preparation Within Cross Validation Folds

You can easily leak information when preparing your data for machine learning.

The effect is overfitting your training data and having an overly optimistic evaluation of you models performance on unseen data.

For example, if you normalize or standardize your entire dataset, then estimate the performance of your model using cross validation, you have committed the sin of data leakage.

The data rescaling process that you performed had knowledge of the full distribution of data in the training dataset when calculating the scaling factors (like min and max or mean and standard deviation). This knowledge was stamped into the rescaled values and exploited by all algorithms in your cross validation test harness.

A non-leaky evaluation of machine learning algorithms in this situation would calculate the parameters for rescaling data within each fold of the cross validation and use those parameters to prepare the data on the held out test fold on each cycle.

The reality is that as a data scientist, you’re at risk of producing a data leakage situation any time you prepare, clean your data, impute missing values, remove outliers, etc. You might be distorting the data in the process of preparing it to the point that you’ll build a model that works well on your “clean” dataset, but will totally suck when applied in the real-world situation where you actually want to apply it.

— Page 313, Doing Data Science: Straight Talk from the Frontline

More generally, non-leaky data preparation must happen within each fold of your cross validation cycle.

You may be able to relax this constraint for some problems, for example if you can confidently estimate the distribution of your data because you have some other domain knowledge.

In general though, it is a good idea to re-prepare or re-calculate any required data preparation within your cross validation folds including tasks like feature selection, outlier removal, encoding, feature scaling and projection methods for dimensionality reduction, and more.

If you perform feature selection on all of the data and then cross-validate, then the test data in each fold of the cross-validation procedure was also used to choose the features and this is what biases the performance analysis.

— Dikran Marsupial, in answer to the question “Feature selection and cross-validation” on Cross Validated.

Platforms like R and scikit-learn in Python help automate this good practice, with the caret package in R and Pipelines in scikit-learn.

2. Hold Back a Validation Dataset

Another, perhaps simpler approach is to split your training dataset into train and validation sets, and store away the validation dataset.

Once you have completed your modeling process and actually created your final model, evaluate it on the validation dataset.

This can give you a sanity check to see if your estimation of performance has been overly optimistic and has leaked.

Essentially the only way to really solve this problem is to retain an independent test set and keep it held out until the study is complete and use it for final validation.

— Dikran Marsupial in answer to the question “How can I help ensure testing data does not leak into training data?” on Cross Validated

5 Tips to Combat Data Leakage

  • Temporal Cutoff. Remove all data just prior to the event of interest, focusing on the time you learned about a fact or observation rather than the time the observation occurred.
  • Add Noise. Add random noise to input data to try and smooth out the effects of possibly leaking variables.
  • Remove Leaky Variables. Evaluate simple rule based models line OneR using variables like account numbers and IDs and the like to see if these variables are leaky, and if so, remove them. If you suspect a variable is leaky, consider removing it.
  • Use Pipelines. Heavily use pipeline architectures that allow a sequence of data preparation steps to be performed within cross validation folds, such as the caret package in R and Pipelines in scikit-learn.
  • Use a Holdout Dataset. Hold back an unseen validation dataset as a final sanity check of your model before you use it.

Further Reading on Data Leakage

There is not a lot of material on data leakage, but those few precious papers and blog posts that do exist are gold.

Below are some of the better resources that you can use to learn more about data leakage in applied machine learning.

Summary

In this post you discovered data leakage in machine learning when developing predictive models.

You learned:

  • What data leakage is when developing predictive models.
  • Why data leakage is a problem and how to detect it.
  • Techniques that you can use to over come data leakage when developing predictive models.

Do you have any questions about data leakage or about this post? Ask your questions in the comments and I will do my best to answer.

41 Responses to Data Leakage in Machine Learning

  1. Calvin August 8, 2016 at 11:42 pm #

    So if I split for instance 60% of data for training, and keep the other 40% apart for test, scaling both after splitted I’m avoiding data leakage, right?

    • Jason Brownlee August 15, 2016 at 9:39 am #

      Hi Calvin, the best practice would be to learn how to scale from the training dataset and use that knowledge to then scale the test dataset.

  2. sandeep September 2, 2016 at 3:20 am #

    Hello Jason
    great article.

    typically, once I have done all the munging, feature creation, pca, removing outliers, binary encoding, I split the data into 3 sets (85% train, 10% val, 5% test).
    then I do all the cross validation, grid/ random search etc on the train set , possibly make some tweaks to the features (e.g add interactions or power terms of the top 2/3 predictors) and verify the results on the validation set. This is done across different modeling techniques.
    and whence I have found the best model, I gauge it’s performance ONLY ONCE on the test set.

    few questions.

    1. I haven’t seen it being done much but does one store the imputed values of the training set somewhere and impute the unknown test set with those. The unknown test set would for e.g. be the one on which your model performance will be gauged in a kaggle competition.

    how would this work, when I use some R packages for imputing data (e.g. amelia etc) which use more sophisticated approaches to imputation than vanilla mean/ mode / median.
    Should one run the package on the unknown set too?

    I haven’t found much difference in results in the models I’ve build when I imputed the unknown test set with training data vs imputing the unknown set with it’s own mean / median etc.

    2. similar to the imputation, what is your advice for outlier methods? perhaps some univariate or multi-variate methods were used to detect outliers in the training data set and then (let’s say) the outlier values were replaced with mean /median etc.
    How does one carry out the same steps and mitigate leakage into the unknown test set?

    3. Would it be possible to see an example of how one can perform outlier removal, encoding, feature scaling, imputation with in each cross validation fold?
    I know caret allows to perform PCA for e.g. when training the model (but then how would I know how many principal components it selects) or more recently, using H2O , one is not recommended to create binary features as H2O takes care of it automatically.

    • Jason Brownlee September 2, 2016 at 8:15 am #

      Thanks sandeep.

      Nive approach, although I would suggest changing your percentages based on the specifics of the problem.

      Any models used to transform training data (like imputing) must be stored, to then be used on any new data in the future like tests and validation sets, and even new data. It is now a part of the model.

      Outliers are tricky. Store the outlier model as above. You often want to filter data in the same way after the model is built and perhaps return an “i don’t know” prediction for data out of bounds of the training dataset. I talk about outliers a little here:
      http://machinelearningmastery.com/how-to-identify-outliers-in-your-data/

      I don’t have a worked example with outlier remove, I don’t think. Sorry.

  3. Seo young jae April 16, 2017 at 2:55 am #

    Nice information!

    But I’m confused.

    In that post, Dikran Marsupial, in answer to the question “Feature selection and cross validation”, In here, feature seleciton means variable selection?????

    And, data leakage is not problem in R tool. Because there is function that solve this problem.(In train() function, we can use “CV” method or “repeatedCV”) is it right???

    Thank you!

    • Jason Brownlee April 16, 2017 at 9:30 am #

      Data leakage can be a problem in R or any platform if we use information from the test set when fitting the model.

      This can be as simple as including test data when scaling training data.

      You are right, tools like caret make this much less of a risk, if the tools are used correctly (e.g. scaling is performed within each cross validation fold).

      • Crazy July 5, 2018 at 3:12 pm #

        If we divide the data into suppose 5 folds, and, scale first 4 and predict the 5th. After this, we use other 4 and scale them. What about the first fold which got scaled along with the previous group? How do we unscale that? Also, do we have to scale the test fold also?

        • Jason Brownlee July 6, 2018 at 6:38 am #

          Yes, any transforms performed are discarded or reversed.

  4. blurlearner June 7, 2017 at 1:32 am #

    Your site has been really helpful, though i’m still kinda blur on why preparing data using normalization or standardization on the entire training dataset (maybe 70% of all data) before learning can cause leakage. why is knowing distribution of the majority of data causing issue?

    • Jason Brownlee June 7, 2017 at 7:25 am #

      To scale data you need coefficients (min/max or avg/stdev).

      These must best estimated from data (unless you know the exact values from the domain).

      If they are estimated from the entire dataset then leakage occurs because knowledge from the test set was used to scale the training set.

      • Helder March 21, 2018 at 8:50 am #

        So, Jason, first of all: Thank you for you great blog!

        Back into the question, so what do you suggest? Perform the normalization (min/max) after split (train_test_split) the data set?

        thank you.

        • Jason Brownlee March 21, 2018 at 3:07 pm #

          Yes, after the split and coefficients from the training data, then applied to train, test, other…

          • Guille May 18, 2018 at 8:51 pm #

            Hi!. For example, if Im using the mean of the training set in NaNs imputation. This mean must be the parameter to apply in the test set or I need to recalculate the mean inside the test set?

            Ex.

            Training set:

            10
            Nan
            5
            10
            5

            Nan – >5(mean)

            Test set:

            1
            1
            2
            Nan
            2

            Nan – >10 from the training set or 1.5 (mean of the test set)???

          • Jason Brownlee May 19, 2018 at 7:38 am #

            The mean would be estimated from the training dataset only.

  5. RK October 16, 2017 at 11:34 pm #

    Excellent blog on machine learning with the detailed information. Appreciate your efforts Jason.

  6. Og Gue October 24, 2017 at 4:56 pm #

    Thank you for creating such an incredibly useful blog that I find that I come back to often for reference! 1) What is the general methodology to engineer features by target encoding the mean of target variable, as grouped by X feature, and measured within the out-of-fold CVs sets? I’ve seen this before as an important engineered feature. 2) Isn’t this very prone to data leakage? If so, how?

    Thank you again!

    • Jason Brownlee October 25, 2017 at 6:39 am #

      As long as the mean or other stats were based on training data alone, it would not be leakage.

  7. Aniket Saxena October 25, 2017 at 11:09 pm #

    Hi Jason,

    What is validation dataset and how can I prepare it to hold back for sanity check of my model, being constructed?
    Do I need to further break the training set to make validation dataset? Correct me if I am wrong.

  8. Mohammad Ehtasham Billah February 8, 2018 at 12:19 pm #

    hi,
    So my understanding is that the data preprocessing should be done at each fold during the cross-validation.And we can do that by caret package.Can caret deal with multicollinearity within each fold as well?

    Thanks.

    • Jason Brownlee February 9, 2018 at 8:58 am #

      Yes, data prep must be performed for each fold.

      Caret may support multicollinearity, I’m not sure off hand.

      • Mohammad Ehtasham Billah February 12, 2018 at 5:52 am #

        Hi,
        So,if we perform the data preprocessing for each fold like box-cox, centering during cross-validation, we don’t need to do same preprocessing at the initial stage like during exploratory data analysis?
        The reason I am asking because most of the time I see that people are preprocessing data (e.g missing value imputation, outlier removing) during the exploratory data analysis.Justed wanted to have the crystal clear idea!!
        Thanks

        • Jason Brownlee February 12, 2018 at 8:34 am #

          Ideally we would do this within the CV fold to avoid leakage. Sometimes, it does not impact the results very much.

  9. Mohammad Ehtasham Billah February 11, 2018 at 7:54 am #

    Hi,
    When we are adding new attributes, removing attributes or creating dummy variables before the fitting models, can these actions cause data leakage as well?

    • Jason Brownlee February 11, 2018 at 7:58 am #

      It can if the procedure uses information outside of the data sample.

  10. Tudor Lapusan March 11, 2018 at 11:26 pm #

    Hi Jason, indeed cool article.

    I cannot understand this phrase in the context of data leakage “if any other feature whose value would not actually be available in practice at the time you’d want to use the model to make a prediction, is a feature that can introduce leakage to your model”.

    What I understand from the above is that we have a feature whose values cannot be calculated during production stage. If so it means that we made a mistake during feature selection step. To solve it, we need to remove that feature and train again our model. Does this mistake has something related to data leakage ?

    If you can point me to an concrete example would be helpful 😉

    Thanks

    • Jason Brownlee March 12, 2018 at 6:32 am #

      It means if you are using information about data you don’t have, e.g. data from the future or data out of sample, then you have data leakage.

      Does that help?

      • Tudor-Lucian Lapusan March 14, 2018 at 10:14 pm #

        I feel I’m on the right way to understand it 🙂

  11. Josiah Yoder August 1, 2018 at 2:53 am #

    The link “Leakage in Data Mining: Formulation, Detection, and Avoidance [pdf], 2011. (recommended!)” to http://dstillery.com/wp-content/uploads/2014/05/Leakage-in-Data-Mining-Formulation-Detection-and-Avoidance.pdf has died.

  12. Josiah Yoder August 2, 2018 at 1:06 am #

    Hi Jason,

    Thank you for making these articles free online. They have been very helpful as I learn Keras this summer.

    I feel that this article would benefit from some practical — if contrived — examples of leakage. I think it covers the theory pretty well, but examples showing leakage actually occurring and how to fix it would really drive the point home.

    As I search for “example” in the comments, I suppose I’m repeating what Sandeep and Tudor said.

  13. Syed Zeeshan October 25, 2018 at 12:18 pm #

    Sir can u provide Code for data leakage prediction?

    • Jason Brownlee October 25, 2018 at 2:01 pm #

      What do you mean exactly? Do you have an example?

  14. Milley November 21, 2018 at 5:42 am #

    Thanks for your wonderful site and article. I am a bit confused on one part: if feature selection happens in the folds, then what is the final model that is chosen? I had thought that repeated k folds for example would average out the models to give final accuracy for example. (The repeated k folds giving some idea of the uncertainty around the predicted accuracy for example).

    But if there are different features used for each fold then what is the suggested final set of features?

    thanks!

    • Jason Brownlee November 21, 2018 at 7:54 am #

      Good question, we train a final model on all available data.

      This post explains in more detail:
      https://machinelearningmastery.com/train-final-machine-learning-model/

      • Joey December 12, 2018 at 8:14 am #

        If I’m understanding the question correctly, Milley is asking, if feature selection is applied to each fold, then you might have a different set of features in each fold. So how do you

        Do you select the final set of features based on the features selected in the fold with the best performance?

        I didn’t see any reference to feature selection in the linked article.

        Please keep up the good work! 🙂

        • Jason Brownlee December 12, 2018 at 2:12 pm #

          I see, excellent question.

          There’s no best way. One approach would be to use the average or superset of top features selected across all folds.

          Or, use a holdout validation set prior to training.

Leave a Reply