How to Save and Reuse Data Preparation Objects in Scikit-Learn

It is critical that any data preparation performed on a training dataset is also performed on a new dataset in the future.

This may include a test dataset when evaluating a model or new data from the domain when using a model to make predictions.

Typically, the model fit on the training dataset is saved for later use. The correct solution to preparing new data for the model in the future is to also save any data preparation objects, like data scaling methods, to file along with the model.

In this tutorial, you will discover how to save a model and data preparation object to file for later use.

After completing this tutorial, you will know:

  • The challenge of correctly preparing test data and new data for a machine learning model.
  • The solution of saving the model and data preparation objects to file for later use.
  • How to save and later load and use a machine learning model and data preparation model on new data.

Kick-start your project with my new book Data Preparation for Machine Learning, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

  • Update Jan/2020: Updated for changes in scikit-learn v0.22 API.
  • Update May/2020: Improved code examples and printed output.
How to Save and Load Models and Data Preparation in Scikit-Learn for Later Use

How to Save and Load Models and Data Preparation in Scikit-Learn for Later Use
Photo by Dennis Jarvis, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. Challenging of Preparing New Data for a Model
  2. Save Data Preparation Objects
  3. How to Save and Later Use a Data Preparation Object

Challenging of Preparing New Data for a Model

Each input variable in a dataset may have different units.

For example, one variable may be in inches, another in miles, another in days, and so on.

As such, it is often important to scale data prior to fitting a model.

This is particularly important for models that use a weighted sum of the input or distance measures like logistic regression, neural networks, and k-nearest neighbors. This is because variables with larger values or ranges may dominate or wash out the effects of variables with smaller values or ranges.

Scaling techniques, such as normalization or standardization, have the effect of transforming the distribution of each input variable to be the same, such as the same minimum and maximum in the case of normalization or the same mean and standard deviation in the case of standardization.

A scaling technique must be fit, which just means it needs to calculate coefficients from data, such as the observed min and max, or the observed mean and standard deviation. These values can also be set by domain experts.

The best practice when using scaling techniques for evaluating models is to fit them on the training dataset, then apply them to the training and test datasets.

Or, when working with a final model, to fit the scaling method on the training dataset and apply the transform to the training dataset and any new dataset in the future.

It is critical that any data preparation or transformation applied to the training dataset is also applied to the test or other dataset in the future.

This is straightforward when all of the data and the model are in memory.

This is challenging when a model is saved and used later.

What is the best practice to scale data when saving a fit model for later use, such as a final model?

Want to Get Started With Data Preparation?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Save Data Preparation Objects

The solution is to save the data preparation object to file along with the model.

For example, it is common to use the pickle framework (built-in to Python) for saving machine learning models for later use, such as saving a final model.

This same framework can be used to save the object that was used for data preparation.

Later, the model and the data preparation object can be loaded and used.

It is convenient to save the entire objects to file, such as the model object and the data preparation object. Nevertheless, experts may prefer to save just the model parameters to file, then load them later and set them into a new model object. This approach can also be used with the coefficients used for scaling the data, such as the min and max values for each variable, or the mean and standard deviation for each variable.

The choice of which approach is appropriate for your project is up to you, but I recommend saving the model and data preparation object (or objects) to file directly for later use.

To make the idea of saving the object and data transform object to file concrete, let’s look at a worked example.

How to Save and Later Use a Data Preparation Object

In this section, we will demonstrate preparing a dataset, fitting a model on the dataset, saving the model and data transform object to file, and later loading the model and transform and using them on new data.

1. Define a Dataset

First, we need a dataset.

We will use a test dataset from the scikit-learn dataset, specifically a binary classification problem with two input variables created randomly via the make_blobs() function.

The example below creates a test dataset with 100 examples, two input features, and two class labels (0 and 1). The dataset is then split into training and test sets and the min and max values of each variable are then reported.

Importantly, the random_state is set when creating the dataset and when splitting the data so that the same dataset is created and the same split of data is performed each time that the code is run.

Running the example reports the min and max values for each variable in both the train and test datasets.

We can see that each variable has a different scale, and that the scales differ between the train and test datasets. This is a realistic scenario that we may encounter with a real dataset.

2. Scale the Dataset

Next, we can scale the dataset.

We will use the MinMaxScaler to scale each input variable to the range [0, 1]. The best practice use of this scaler is to fit it on the training dataset and then apply the transform to the training dataset, and other datasets: in this case, the test dataset.

The complete example of scaling the data and summarizing the effects is listed below.

Running the example prints the effect of the scaled data showing the min and max values for each variable in the train and test datasets.

We can see that all variables in both datasets now have values in the desired range of 0 to 1.

3. Save Model and Data Scaler

Next, we can fit a model on the training dataset and save both the model and the scaler object to file.

We will use a LogisticRegression model because the problem is a simple binary classification task.

The training dataset is scaled as before, and in this case, we will assume the test dataset is currently not available. Once scaled, the dataset is used to fit a logistic regression model.

We will use the pickle framework to save the LogisticRegression model to one file, and the MinMaxScaler to another file.

The complete example is listed below.

Running the example scales the data, fits the model, and saves the model and scaler to files using pickle.

You should have two files in your current working directory:

  • model.pkl
  • scaler.pkl

4. Load Model and Data Scaler

Finally, we can load the model and the scaler object and make use of them.

In this case, we will assume that the training dataset is not available, and that only new data or the test dataset is available.

We will load the model and the scaler, then use the scaler to prepare the new data and use the model to make predictions. Because it is a test dataset, we have the expected target values, so we will compare the predictions to the expected target values and calculate the accuracy of the model.

The complete example is listed below.

Running the example loads the model and scaler, then uses the scaler to prepare the test dataset correctly for the model, meeting the expectations of the model when it was trained.

To confirm the scaler is having the desired effect, we report the min and max value for each input feature both before and after applying the scaling. The model then makes a prediction for the examples in the test set and the classification accuracy is calculated.

In this case, as expected, the data set correctly normalized the model achieved 100 percent accuracy on the test set because the test problem is trivial.

This provides a template that you can use to save both your model and scaler object (or objects) to file on your own projects.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Posts

APIs

Summary

In this tutorial, you discovered how to save a model and data preparation object to file for later use.

Specifically, you learned:

  • The challenge of correctly preparing test data and new data for a machine learning model.
  • The solution of saving the model and data preparation objects to file for later use.
  • How to save and later load and use a machine learning model and data preparation model on new data.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Get a Handle on Modern Data Preparation!

Data Preparation for Machine Learning

Prepare Your Machine Learning Data in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Data Preparation for Machine Learning

It provides self-study tutorials with full working code on:
Feature Selection, RFE, Data Cleaning, Data Transforms, Scaling, Dimensionality Reduction, and much more...

Bring Modern Data Preparation Techniques to
Your Machine Learning Projects


See What's Inside

48 Responses to How to Save and Reuse Data Preparation Objects in Scikit-Learn

  1. Avatar
    Venkat November 21, 2019 at 3:59 pm #

    Thank You for providing valuable information about Data Preparation and explanation about data set.

  2. Avatar
    Elie Kawerk November 21, 2019 at 8:42 pm #

    Hi Jason,

    Wouldn’t it be more suitable to wrap scaler and model in a sklearn pipeline object and pickle the pipeline?

    Best,
    Elie

    • Avatar
      Jason Brownlee November 22, 2019 at 6:02 am #

      Yes, agreed!

      In this case, I was trying to get across the idea of saving the scaler objects. A pipeline would be a simpler implementation.

  3. Avatar
    Alberto Morales November 26, 2019 at 6:40 pm #

    Hi, guys.

    It’s an interesting post, but… How about running this processes into production?

    load(open.. load? dump(model, open…? to “local” filesystems? It’s not enough. Do you agree?

    A lot or organizations fail during the last step (going onto production), it’s necessary to apply software engineering techniques to success.

    Just pickling a bunch of objects doesn’t help you.

    Kind regards.

    • Avatar
      Jason Brownlee November 27, 2019 at 6:02 am #

      Nice point.

      Yes, saving/loading is good as first step. A better approach might be to store the coefficients used for scaling in a config and load them into the app and apply them to new data as needed, either with custom data prep code or plugging the coefficients into sklearn objects.

      Does that help?

      • Avatar
        Nadia Chaabouni May 8, 2020 at 1:59 am #

        Thank you Jason for this interesting post!
        Could you please give a small example or point out a useful link with the config method ?
        Thank you in advance

      • Avatar
        Stephen Fickas June 18, 2021 at 4:26 am #

        Hi Jason

        You noted “A better approach might be to store the coefficients used for scaling in a config and load them into the app.”

        I’ve run into this problem with sklearn.impute import KNNImputer. I fitted with full dataset and saved imputer. At production time I load it back in. But when I get a new sample, imputed values are not correct. You mention saving coefficients. I don’t see how to save coefficients from the imputer. Am I missing something?

        Thanks.

        • Avatar
          Jason Brownlee June 18, 2021 at 5:48 am #

          Perhaps you can try a different imputer that either requires less overhead or works better on new data.

  4. Avatar
    Cristian Lazaro February 21, 2020 at 4:30 pm #

    Thank you what a great post much appreciated helped me quite a bit. Very Informative.

  5. Avatar
    Emily May 11, 2020 at 4:11 pm #

    Hi Jason,

    I built a clustering model to segment customer accounts. It had the following steps:

    – Min Max scalar
    – PCA for dimensionality reduction and then
    – K means model to get ~ 5 clusters

    I saved the entire pipeline using pickle and now I have to refresh the model with data for latest month. When I refreshed the model, the clusters drastically changed when compared to last month. I know they are stochastic in nature but didn’t expect ~40% movement of accounts.

    Is it because of the choice of algo, I mean K means? What could be the best model refresh strategy given that I’ll have to refresh every month?

    I know for sure that the Business owner won’t be happy if there’s that much movement.

    • Avatar
      Jason Brownlee May 12, 2020 at 6:36 am #

      Good question!

      You may have to re-fit the model with all of the old and the new data. I don’t think the sklearn models support incremental updates.

  6. Avatar
    VLADIMIR KIM May 25, 2020 at 4:41 pm #

    Thank you very very much! Exactly what i needed

    • Avatar
      Jason Brownlee May 26, 2020 at 6:15 am #

      You’re welcome.

      • Avatar
        gsamaras June 8, 2022 at 5:12 am #

        In section “4. Load Model and Data Scaler, in code line 26: acc = accuracy_score(y_test, yhat), where y_test is unscaled data, whereas yhat is scaled data.

        I would expect that one would compare the two vectors in the same space, either both scaled, or both unscaled. In the example this doesn’t happen, which confuses me.

        Is it clearer now Jason?

        PS: Sorry for replying now, but I didn’t receive any email notification about your reply.

  7. Avatar
    Mukul Verma May 27, 2020 at 5:13 am #

    Hi there,Thank you so much for such a great Tutorial.You saved lot of my time.I have been worrying about how to give scaled input to a model (loaded from .h5 file of a model trained earlier by me) to get the same scaled input data.
    Thank you so much.

    • Avatar
      Jason Brownlee May 27, 2020 at 8:03 am #

      Load the data, scale it, load the model, pass the data to the loaded model.

  8. Avatar
    sandip pani July 7, 2020 at 7:00 pm #

    HI Jason,
    This is helpful. Thanks for this article.

    Can you also explain suppose I have preprocessing steps where I use custom logic to fill missing value. Ex. If Age is missing I find Mode of age for each group and assign accordingly if age is missing for any. So prediction as user enters all-details except Age in the form , so Would I call custom preprocessing here.

    • Avatar
      Jason Brownlee July 8, 2020 at 6:29 am #

      Thank you.

      Yes, you would need to save any statistics used by custom data prep.

  9. Avatar
    dan August 4, 2020 at 10:21 pm #

    Thanks for your post, that’s very useful. I have a question. i have a unlabeled dataset to which i already made a cluster analysis in order to use the labels of the clusters as a target variable in a supervised classification task. So I have trained my model and saved using pickle. If I want to use my model with new data, I need have a supervised data? in other words, the pickle object does not keep the information of the labels?

    Thanks in advance

    • Avatar
      Jason Brownlee August 5, 2020 at 6:13 am #

      A supervised learning model can be used to make predictions on new data, input only.

      This usage of the model is the goal of supervised learning – to predict labels for new data.

  10. Avatar
    John October 8, 2020 at 11:59 pm #

    Thank you, this is very helpful!

    There may be a problem if the new (or test) dataset includes features with min/max values lower/higher than in the training dataset. I assume the scaler could still handle it, but the transformed high values would be above 1? It could be worse in the case of categorical features which are not represented in the training set. Do you know a solution for this case?

    Thank you in advance

    • Avatar
      Jason Brownlee October 9, 2020 at 6:47 am #

      You might need to manually clip new values to the known range first.

      For categorical, you can set an argument to ignore new labels not seen during training, e.g. map to all zeros.

      • Avatar
        John October 9, 2020 at 5:59 pm #

        Great, thank you for the advice! I didn’t know that there is an argument to ignore new labels.

        • Avatar
          Jason Brownlee October 10, 2020 at 7:01 am #

          Yes, set”

  11. Avatar
    LMannik November 12, 2020 at 5:36 am #

    Hi, thanks for this blog. It’s my go-to for machine learning questions and answers. I have a question about assigning the fit for the model in these steps from your code:

    # define model
    model = LogisticRegression(solver=’lbfgs’)
    model.fit(X_train_scaled, y_train)

    For example, in the last line, model.fit(X_train_scaled, y_train), I expected to see something like fitted = model.fit(X_train_scaled, y_train), and then you’d pickle “fitted”. How are you saving the trained model without assigning it to a new attribute? Or are you really just pickling the untrained model from the previous line of code: model = LogisticRegression(solver=’lbfgs’)?

    • Avatar
      Jason Brownlee November 12, 2020 at 6:43 am #

      You’re welcome.

      We fit the model by a call to fit() then save it. Recall model is an object it contains the coefficients required by the model.

  12. Avatar
    Hesham Abdelghany January 1, 2021 at 4:38 pm #

    Hi Jason,

    Thanks for the great post.
    I have a question regarding scaling the output target variable:

    1) Does the scaling method for target variable needs to be the same as the input features?

    2) Do I need save 2 scaling objects, one for features different from the one for target variable?

    3) If I pass scaling object for target variable from training stage to prediction stage, wouldn’t that be a cause for data leaking from training to production?

    • Avatar
      Jason Brownlee January 2, 2021 at 6:21 am #

      You’re welcome.

      No, you can use different scaling on diffrent variables.

      Yes, each object would need to be saved.

      No, data leakage refers to using information from the test set during the training of the model.

  13. Avatar
    vian January 2, 2021 at 7:16 am #

    thanx a lot you always help me more God bless you

    • Avatar
      Jason Brownlee January 2, 2021 at 7:56 am #

      You’re welcome!

      • Avatar
        vian January 3, 2021 at 12:46 am #

        please, I have a question.
        if I have 30 features and want to reduce 6 of them to be one feature (30 become 25) if I use PCA on my training data and save model can I use it later on another test date?

  14. Avatar
    Mark Littlewood February 16, 2021 at 11:11 pm #

    Also how about after fitting the scaler

    model.scaler = scaler

    and then pickle the model

    When loading it just use model.scaler to scale the test file

    • Avatar
      Jason Brownlee February 17, 2021 at 5:28 am #

      You can save the data prep + model in a pipeline (save the pipeline object). That would be my recommendation.

  15. Avatar
    anil December 14, 2021 at 9:13 pm #

    hi. Why are we assuming test or train datasets are unavailable? When would this happen?

    • Avatar
      Adrian Tam December 15, 2021 at 7:22 am #

      If you saved the model, why you need to access to the training dataset again?

    • Avatar
      James Carmichael December 21, 2021 at 11:32 pm #

      Hi Anil…Once a model is trained and tested it may then be ready for validation. Validation data is data never seen by the network and in fact could represent what the model would encounter in practical use or “the real world”. Thus, when being used in the real world, the model will no longer be trained and tested unless it undergoes a fundamental redesign or major modification.

      -Regards,

  16. Avatar
    Daniel April 14, 2022 at 10:28 am #

    Hi thanks for the post. I have a question: I already preprocessed the data (reshape it to vector for KNN classification) and I am not scaling it. How to save the data without scaling it? In your post you scaled the data and saved the scaler. But I am not scaling my data, which is not required.
    Thanks.

    • Avatar
      James Carmichael April 15, 2022 at 7:36 am #

      Hi Daniel…You may simply just leave out the steps that scale the data.

  17. Avatar
    Ali Raza May 8, 2022 at 8:22 pm #

    I have a deep learning lstm model, I have two questions:

    1. Do I need to save scaler in .h5 format?
    2. How to save scaler in .h5 format>

  18. Avatar
    gsamaras June 8, 2022 at 12:15 am #

    Tnx for the good post. Question: Section4, you get y_test (line 9) and then compare it against yhat in acc = accuracy_score(y_test, yhat), where yhat is the prediction performed on scaled data. So is that correct? I mean won’t yhat belong to the scaled space, while y_test to the original (unscaled) space?

    For example, according to my intuition I have a dataset in [-100, 100]. I scale it to [0, 1]. A new data comes in, ready for prediction, e.g. 99. The groundtruth is 100.
    I scale it, let’s say it becomes 0.98.
    I pass it to the model, which I’ll assume it will predict something close to [0, 1], probably something a bit greater than 1, let’s say 1.05.

    Now I thought that I should unscale (somehow with inverse_transform) 1,05, so that it maps back to the original space and become 105 for example.

    So I would say that my predicted value is 105 and the groundtruth value 100. Can you help me with this pls?

    • Avatar
      James Carmichael June 8, 2022 at 4:03 am #

      Hi gsamaras…Please rephrase or clarify your question so that we may better assist you.

      • Avatar
        gsamaras June 8, 2022 at 5:13 am #

        In section “4. Load Model and Data Scaler, in code line 26: acc = accuracy_score(y_test, yhat), where y_test is unscaled data, whereas yhat is scaled data.

        I would expect that one would compare the two vectors in the same space, either both scaled, or both unscaled. In the example this doesn’t happen, which confuses me.

        Is it clearer now Jason?

        PS: Sorry for replying now, but I didn’t receive any email notification about your reply.

  19. Avatar
    Alfons February 24, 2023 at 1:55 am #

    gsamaras, a little bit late but you are right – one needs the yhat_scaler too to do the inverse to get back to the original space

Leave a Reply