Save And Finalize Your Machine Learning Model in R

By Jason Brownlee on August 22, 2019 in R Machine Learning 61

Finding an accurate machine learning is not the end of the project.

In this post you will discover how to finalize your machine learning model in R including: making predictions on unseen data, re-building the model from scratch and saving your model for later use.

Kick-start your project with my new book Machine Learning Mastery With R, including step-by-step tutorials and the R source code files for all examples.

Let’s get started.

Finalize Your Machine Learning Model in R.
Photo by Christian Schnettelker, some rights reserved.

Finalize Your Machine Learning Model

Once you have an accurate model on your test harness you are nearly, done. But not yet.

There are still a number of tasks to do to finalize your model. The whole idea of creating an accurate model for your dataset was to make predictions on unseen data.

There are three tasks you may be concerned with:

Making new predictions on unseen data.
Creating a standalone model using all training data.
Saving your model to file for later loading and making predictions on new data.

Once you have finalized your model you are ready to make use of it. You could use the R model directly. You could also discover the key internal representation found by the learning algorithm (like the coefficients in a linear model) and use them in a new implementation of the prediction algorithm on another platform.

In the next section, you will look at how you can finalize your machine learning model in R.

Need more Help with R for Machine Learning?

Take my free 14-day email course and discover how to use R on your project (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Finalize Predictive Model in R

Caret is an excellent tool that you can use to find good or even best machine learning algorithms and parameters for machine learning algorithms.

But what do you do after you have discovered a model that is accurate enough to use?

Once you have found a good model in R, you have three main concerns:

Making new predictions using your tuned caret model.
Creating a standalone model using the entire training dataset.
Saving/Loading a standalone model to file.

This section will step you through how to achieve each of these tasks in R.

1. Make Predictions On New Data

You can make new predictions using a model you have tuned using caret using the predict.train() function.

In the recipe below, the dataset is split into a validation dataset and a training dataset. The validation dataset could just as easily be a new dataset stored in a separate file and loaded as a data frame.

A good model of the data is found using LDA. We can see that caret provides access to the best model from a training run in the finalModel variable.

We can use that model to make predictions by calling predict using the fit from train which will automatically use the final model. We must specify the data one which to make predictions via the newdata argument.

# load libraries
library(caret)
library(mlbench)
# load dataset
data(PimaIndiansDiabetes)
# create 80%/20% for training and validation datasets
set.seed(9)
validation_index <- createDataPartition(PimaIndiansDiabetes$diabetes, p=0.80, list=FALSE)
validation <- PimaIndiansDiabetes[-validation_index,]
training <- PimaIndiansDiabetes[validation_index,]
# train a model and summarize model
set.seed(9)
control <- trainControl(method="cv", number=10)
fit.lda <- train(diabetes~., data=training, method="lda", metric="Accuracy", trControl=control)
print(fit.lda)
print(fit.lda$finalModel)
# estimate skill on validation dataset
set.seed(9)
predictions <- predict(fit.lda, newdata=validation)
confusionMatrix(predictions, validation$diabetes)

# load libraries

library(caret)

library(mlbench)

# load dataset

data(PimaIndiansDiabetes)

# create 80%/20% for training and validation datasets

set.seed(9)

validation_index <- createDataPartition(PimaIndiansDiabetes$diabetes, p=0.80, list=FALSE)

validation <- PimaIndiansDiabetes[-validation_index,]

training <- PimaIndiansDiabetes[validation_index,]

# train a model and summarize model

set.seed(9)

control <- trainControl(method="cv", number=10)

fit.lda <- train(diabetes~., data=training, method="lda", metric="Accuracy", trControl=control)

print(fit.lda)

print(fit.lda$finalModel)

# estimate skill on validation dataset

set.seed(9)

predictions <- predict(fit.lda, newdata=validation)

confusionMatrix(predictions, validation$diabetes)

Running the example, we can see that the estimated accuracy on the training dataset was 76.91%. Using the finalModel in the fit, we can see that the accuracy on the hold out validation dataset was 77.78%, very similar to our estimate.

Resampling results

  Accuracy   Kappa    Accuracy SD  Kappa SD 
  0.7691169  0.45993  0.06210884   0.1537133

...

Confusion Matrix and Statistics

          Reference
Prediction neg pos
       neg  85  19
       pos  15  34
                                          
               Accuracy : 0.7778          
                 95% CI : (0.7036, 0.8409)
    No Information Rate : 0.6536          
    P-Value [Acc > NIR] : 0.000586        
                                          
                  Kappa : 0.5004          
 Mcnemar's Test P-Value : 0.606905        
                                          
            Sensitivity : 0.8500          
            Specificity : 0.6415          
         Pos Pred Value : 0.8173          
         Neg Pred Value : 0.6939          
             Prevalence : 0.6536          
         Detection Rate : 0.5556          
   Detection Prevalence : 0.6797          
      Balanced Accuracy : 0.7458          
                                          
       'Positive' Class : neg

Resampling results

Accuracy Kappa Accuracy SD Kappa SD

0.7691169 0.45993 0.06210884 0.1537133

...

Confusion Matrix and Statistics

Reference

Prediction neg pos

neg 85 19

pos 15 34

Accuracy : 0.7778

95% CI : (0.7036, 0.8409)

No Information Rate : 0.6536

P-Value [Acc > NIR] : 0.000586

Kappa : 0.5004

Mcnemar's Test P-Value : 0.606905

Sensitivity : 0.8500

Specificity : 0.6415

Pos Pred Value : 0.8173

Neg Pred Value : 0.6939

Prevalence : 0.6536

Detection Rate : 0.5556

Detection Prevalence : 0.6797

Balanced Accuracy : 0.7458

'Positive' Class : neg

2. Create A Standalone Model

In this example, we have tuned a random forest with 3 different values for mtry and ntree set to 2000. By printing the fit and the finalModel, we can see that the most accurate value for mtry was 2.

Now that we know a good algorithm (random forest) and the good configuration (mtry=2, ntree=2000) we can create the final model directly using all of the training data. We can lookup the “rf” random forest implementation used by caret in the Caret List of Models and note that it is using the randomForest package and in turn the randomForest() function.

The example creates a new model directly and uses it to make predictions on the new data, this case simulated as the verification dataset.

# load libraries
library(caret)
library(mlbench)
library(randomForest)
# load dataset
data(Sonar)
set.seed(7)
# create 80%/20% for training and validation datasets
validation_index <- createDataPartition(Sonar$Class, p=0.80, list=FALSE)
validation <- Sonar[-validation_index,]
training <- Sonar[validation_index,]
# train a model and summarize model
set.seed(7)
control <- trainControl(method="repeatedcv", number=10, repeats=3)
fit.rf <- train(Class~., data=training, method="rf", metric="Accuracy", trControl=control, ntree=2000)
print(fit.rf)
print(fit.rf$finalModel)
# create standalone model using all training data
set.seed(7)
finalModel <- randomForest(Class~., training, mtry=2, ntree=2000)
# make a predictions on "new data" using the final model
final_predictions <- predict(finalModel, validation[,1:60])
confusionMatrix(final_predictions, validation$Class)

# load libraries

library(caret)

library(mlbench)

library(randomForest)

# load dataset

data(Sonar)

set.seed(7)

# create 80%/20% for training and validation datasets

validation_index <- createDataPartition(Sonar$Class, p=0.80, list=FALSE)

validation <- Sonar[-validation_index,]

training <- Sonar[validation_index,]

# train a model and summarize model

set.seed(7)

control <- trainControl(method="repeatedcv", number=10, repeats=3)

fit.rf <- train(Class~., data=training, method="rf", metric="Accuracy", trControl=control, ntree=2000)

print(fit.rf)

print(fit.rf$finalModel)

# create standalone model using all training data

set.seed(7)

finalModel <- randomForest(Class~., training, mtry=2, ntree=2000)

# make a predictions on "new data" using the final model

final_predictions <- predict(finalModel, validation[,1:60])

confusionMatrix(final_predictions, validation$Class)

We can see that the estimated accuracy of the optimal configuration was 85.07%. We can see that the accuracy of the final standalone model trained on all of the training dataset and predicting for the validation dataset was 82.93%.

Random Forest 

167 samples
 60 predictor
  2 classes: 'M', 'R' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 151, 150, 150, 150, 151, 150, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa      Accuracy SD  Kappa SD 
   2    0.8507353  0.6968343  0.07745360   0.1579125
  31    0.8064951  0.6085348  0.09373438   0.1904946
  60    0.7927696  0.5813335  0.08768147   0.1780100

Accuracy was used to select the optimal model using  the largest value.
The final value used for the model was mtry = 2. 

...

Call:
 randomForest(x = x, y = y, ntree = 2000, mtry = param$mtry) 
               Type of random forest: classification
                     Number of trees: 2000
No. of variables tried at each split: 2

        OOB estimate of  error rate: 14.37%
Confusion matrix:
   M  R class.error
M 83  6  0.06741573
R 18 60  0.23076923

...

Confusion Matrix and Statistics

          Reference
Prediction  M  R
         M 20  5
         R  2 14
                                          
               Accuracy : 0.8293          
                 95% CI : (0.6794, 0.9285)
    No Information Rate : 0.5366          
    P-Value [Acc > NIR] : 8.511e-05       
                                          
                  Kappa : 0.653           
 Mcnemar's Test P-Value : 0.4497          
                                          
            Sensitivity : 0.9091          
            Specificity : 0.7368          
         Pos Pred Value : 0.8000          
         Neg Pred Value : 0.8750          
             Prevalence : 0.5366          
         Detection Rate : 0.4878          
   Detection Prevalence : 0.6098          
      Balanced Accuracy : 0.8230          
                                          
       'Positive' Class : M

Random Forest

167 samples

60 predictor

2 classes: 'M', 'R'

No pre-processing

Resampling: Cross-Validated (10 fold, repeated 3 times)

Summary of sample sizes: 151, 150, 150, 150, 151, 150, ...

Resampling results across tuning parameters:

mtry Accuracy Kappa Accuracy SD Kappa SD

2 0.8507353 0.6968343 0.07745360 0.1579125

31 0.8064951 0.6085348 0.09373438 0.1904946

60 0.7927696 0.5813335 0.08768147 0.1780100

Accuracy was used to select the optimal model using the largest value.

The final value used for the model was mtry = 2.

...

Call:

randomForest(x = x, y = y, ntree = 2000, mtry = param$mtry)

Type of random forest: classification

Number of trees: 2000

No. of variables tried at each split: 2

OOB estimate of error rate: 14.37%

Confusion matrix:

M R class.error

M 83 6 0.06741573

R 18 60 0.23076923

...

Confusion Matrix and Statistics

Reference

Prediction M R

M 20 5

R 2 14

Accuracy : 0.8293

95% CI : (0.6794, 0.9285)

No Information Rate : 0.5366

P-Value [Acc > NIR] : 8.511e-05

Kappa : 0.653

Mcnemar's Test P-Value : 0.4497

Sensitivity : 0.9091

Specificity : 0.7368

Pos Pred Value : 0.8000

Neg Pred Value : 0.8750

Prevalence : 0.5366

Detection Rate : 0.4878

Detection Prevalence : 0.6098

Balanced Accuracy : 0.8230

'Positive' Class : M

Some simpler models, like linear models can output their coefficients. This is useful, because from these, you can implement the simple prediction procedure in your language of choice and use the coefficients to get the same accuracy. This gets more difficult as the complexity of the representation increases.

3. Save and Load Your Model

You can save your best models to a file so that you can load them up later and make predictions.

In this example we split the Sonar dataset into a training dataset and a validation dataset. We take our validation dataset as new data to test our final model. We train the final model using the training dataset and our optimal parameters, then save it to a file called final_model.rds in the local working directory.

The model is serialized. It can be loaded at a later time by calling readRDS() and assigning the object that is loaded (in this case a random forest fit) to a variable name. The loaded random forest is then used to make predictions on new data, in this case the validation dataset.

# load libraries
library(caret)
library(mlbench)
library(randomForest)
library(doMC)
registerDoMC(cores=8)
# load dataset
data(Sonar)
set.seed(7)
# create 80%/20% for training and validation datasets
validation_index <- createDataPartition(Sonar$Class, p=0.80, list=FALSE)
validation <- Sonar[-validation_index,]
training <- Sonar[validation_index,]
# create final standalone model using all training data
set.seed(7)
final_model <- randomForest(Class~., training, mtry=2, ntree=2000)
# save the model to disk
saveRDS(final_model, "./final_model.rds")

# later...

# load the model
super_model <- readRDS("./final_model.rds")
print(super_model)
# make a predictions on "new data" using the final model
final_predictions <- predict(super_model, validation[,1:60])
confusionMatrix(final_predictions, validation$Class)

# load libraries

library(caret)

library(mlbench)

library(randomForest)

library(doMC)

registerDoMC(cores=8)

# load dataset

data(Sonar)

set.seed(7)

# create 80%/20% for training and validation datasets

validation_index <- createDataPartition(Sonar$Class, p=0.80, list=FALSE)

validation <- Sonar[-validation_index,]

training <- Sonar[validation_index,]

# create final standalone model using all training data

set.seed(7)

final_model <- randomForest(Class~., training, mtry=2, ntree=2000)

# save the model to disk

saveRDS(final_model, "./final_model.rds")

# later...

# load the model

super_model <- readRDS("./final_model.rds")

print(super_model)

# make a predictions on "new data" using the final model

final_predictions <- predict(super_model, validation[,1:60])

confusionMatrix(final_predictions, validation$Class)

We can see that the accuracy on the validation dataset was 82.93%.

Confusion Matrix and Statistics

          Reference
Prediction  M  R
         M 20  5
         R  2 14
                                          
               Accuracy : 0.8293          
                 95% CI : (0.6794, 0.9285)
    No Information Rate : 0.5366          
    P-Value [Acc > NIR] : 8.511e-05       
                                          
                  Kappa : 0.653           
 Mcnemar's Test P-Value : 0.4497          
                                          
            Sensitivity : 0.9091          
            Specificity : 0.7368          
         Pos Pred Value : 0.8000          
         Neg Pred Value : 0.8750          
             Prevalence : 0.5366          
         Detection Rate : 0.4878          
   Detection Prevalence : 0.6098          
      Balanced Accuracy : 0.8230          
                                          
       'Positive' Class : M

Confusion Matrix and Statistics

Reference

Prediction M R

M 20 5

R 2 14

Accuracy : 0.8293

95% CI : (0.6794, 0.9285)

No Information Rate : 0.5366

P-Value [Acc > NIR] : 8.511e-05

Kappa : 0.653

Mcnemar's Test P-Value : 0.4497

Sensitivity : 0.9091

Specificity : 0.7368

Pos Pred Value : 0.8000

Neg Pred Value : 0.8750

Prevalence : 0.5366

Detection Rate : 0.4878

Detection Prevalence : 0.6098

Balanced Accuracy : 0.8230

'Positive' Class : M

Summary

In this post you discovered three recipes for working with final predictive models:

How to make predictions using the best model from caret tuning.
How to create a standalone model using the parameters found during caret tuning.
How to save and later load a standalone model and use it to make predictions.

You can work through these recipes to understand them better. You can also use them as a template and copy-and-paste them into your current or next machine learning project.

Next Step

Did you try out these recipes?

Start your R interactive environment.
Type or copy-paste the recipes above and try them out.
Use the built-in help in R to learn more about the functions used.

Do you have a question. Ask it in the comments and I will do my best to answer it.

61 Responses to Save And Finalize Your Machine Learning Model in R

John ford November 1, 2016 at 3:07 pm #

Hi. I’ve been working through the examples, and some models of my own, and i have a question on preprocessing my data.

When training my models, I used the preProcess argument to center and scale the data.

When I have a new data set to run through my model after training, does the model know to preprocess that data as well? Or do i have to manually scale and center the new data set before applying the model to it? What about after saving and reloading the final model?

this is one of the less-clear points in the docs.

Reply
- Jason Brownlee November 2, 2016 at 9:07 am #
  
  Hi John, great question.
  
  The safest way is to managing data scaling separately and save scaled versions of your data as well as the coefficients needed to scale new data in the future.
  
  According to the caret doco, any preprocessing applied during training will be applied to later calls to predict() on new data:
  http://topepo.github.io/caret/model-training-and-tuning.html#preproc
  
  I would expect this to be preserved if you saved the trained model, but I would suggest testing this out (do pre-save and post-save results match for a model with preprocessing).
  
  Reply
Hans June 7, 2017 at 9:41 pm #

Are there concerns about applying this to non-classification-problems?
What would be the differences?

Reply
- Jason Brownlee June 8, 2017 at 7:42 am #
  
  No difference of note.
  
  Reply
  - Hans June 9, 2017 at 6:47 am #
    
    Hm…
    
    Error: Metric Accuracy not applicable for regression models
    Error in [.data.frame(validation, , 1:60) : undefined columns selected
    
    Reply
    - Hans June 9, 2017 at 7:24 am #
      
      1. & 2. fixed, Still getting errors:
      Error in confusionMatrix.default(final_predictions, validation$n5) :
      the data cannot have more levels than the reference
      
      Reply
Hans June 9, 2017 at 7:47 am #

If I do a print(final_predictions), I get:

3 6 9 11 18 19 21 26 29 39 47 50 61 63 74 92 94 95 97 99 101 104 111 114
M R M R R R M R M R M R R R R R R R R M M R M M
117 120 126 129 137 138 145 146 148 155 160 164 170 171 173 191 206
M M M M M M M M M M M M M R M M M

The last data point has a row index of 206. The original data from Sonar has an end index of 208.
So where is the unseen data here? Are the row indexes in depended from each other?

Reply
Hans June 9, 2017 at 7:57 am #

How to predict one step of unseen data in the future with example 2?

Reply
Ste June 20, 2017 at 10:05 pm #

Dear Jason, thanks for the precious post.

I’m going a bit too far maybe, but I’ve been wondering recently if it is possible to save a trained model in some kind of standard format so that it can be (1) sent over a network and (2) parsed even by a different language (I dunno, Java for example).

Are you aware of anything similar, for R or (possibly) other machine learning / statistics suites?

Apologies if the question is considered off-topic.

Reply
- Jason Brownlee June 21, 2017 at 8:15 am #
  
  I believe there are standard model formats, but I am not an expert in them.
  
  See the Predictive Model Markup Language
  https://en.wikipedia.org/wiki/Predictive_Model_Markup_Language
  
  Reply
Chandra July 25, 2017 at 11:55 pm #

Hi Jason – Thanks for the example! This came in very handy and just in time for me.

Reply
- Jason Brownlee July 26, 2017 at 7:57 am #
  
  I’m glad to hear that!
  
  Reply
Christian Ruiz September 7, 2017 at 7:39 pm #

Hi Jason
Great website! Thank you for the very valuable information.

I have a short question concerning the “standalone model”. Why is it necessary to use it after using the caret one? I though bought would be equivalent (i.e. caret using the randomForest one in this example of “rf”-method) and am now a bit confused.

Cheers,
Chris

Reply
- Jason Brownlee September 9, 2017 at 11:41 am #
  
  Caret is a wrapper for the models that can help us find which model to use.
  
  Once we know which model to use, we can use it directly without the caret wrapper – if we want.
  
  Reply
  - Christian Ruiz May 24, 2018 at 12:08 am #
    
    Thanks!
    
    Reply
SUMANTA September 19, 2017 at 3:39 pm #

Hi Jason,

Excellent work and quite helpful for us who is in learning mode. Just one question here. Suppose, I’ve 100 data-points and I divide into train & test (80 & 20). I fit the model and see accuracy as 75%. My question is how do I implement this model so that when 101st and 102nd data arrives, this model runs and provides me some classification?

Reply
- Jason Brownlee September 19, 2017 at 3:47 pm #
  
  I believe this post shows you how to save the model and load it to make predictions on new data.
  
  Perhaps this post will make things clearer:
  https://machinelearningmastery.com/train-final-machine-learning-model/
  
  Reply
Monte September 23, 2017 at 4:15 am #

Hi Jason.

The “finalize machine learning models” article above is helpful, and I can’t wait to try it out. However, I notice that it’s making one prediction. Is here a way to get multiple predictions at once, based on criteria, and give them as a list? Like this:

Item 1…result
Item 2…result

Reply
- Jason Brownlee September 23, 2017 at 5:44 am #
  
  Yes, the predict function will support this by default.
  
  Reply
Zalak November 3, 2017 at 6:14 pm #

Dear Sir,
I am doing dissertation in load balancing in distributed computing system in which I want to predict the future incoming jobs on the basis of past load information given in the real time dataset.So how can I apply this technique in my work?

Reply
- Jason Brownlee November 4, 2017 at 5:28 am #
  
  This might be a good place for you to start:
  https://machinelearningmastery.com/start-here/#timeseries
  
  Reply
Martín Solís November 6, 2017 at 2:32 am #

Hello Jason , I have a question, maybe you can help me.

I am using caret, and I need to print the function of a ramdom forest best model, I am talking about all the trees, that are 100.
I know with model$finalModel I can print the function of neuronal network or tree with rpart, but this command don’t print all the trees of random forest or the function of boosted trees
Do you Know how I can do it

Thanks

Reply
- Jason Brownlee November 6, 2017 at 4:54 am #
  
  Sorry, I have not printed all of the trees from a random forest before. Perhaps there is a third party tool to help?
  
  Reply
Srini April 9, 2018 at 10:56 pm #

Hey Jason! Thanks for the post. I have a question to you.

I have a built a highly accurate random forest model for a student enrollment prediction project ,let’s say, for this year. Now I have saved the model and want to use this model for data belonging to ,let’s assume, last year. But the number of variables in my last year’s data has changed. How to approach this situation? Is there any way to scale this model to fit the new case?

Thank you!

Reply
- Jason Brownlee April 10, 2018 at 6:20 am #
  
  This process will help you work through your problem systematically:
  https://machinelearningmastery.com/start-here/#process
  
  Reply
  - Srini April 10, 2018 at 12:42 pm #
    
    Thanks Jason!
    
    Reply
Ken July 30, 2018 at 7:27 am #

Hello guys! What ML tool (regression model) would be appropriate for predicting call center call volume

Reply
- Jason Brownlee July 30, 2018 at 2:13 pm #
  
  This process will help you work through your predictive modeling problem:
  https://machinelearningmastery.com/start-here/#process
  
  Reply
Griffin November 28, 2018 at 12:57 pm #

Hi Jason,

many thanks for this superb post!

Would like to know if the following process (essentially what you did in this post) is considered best practice or common practice in ML and DS:
1. Perform cross-validation to tune hyperparameters
2. using optimised hyperparameters to train the ENTIRE training dataset to arrive at an optimal accuracy?

And also, in part 3 on saving and loading the model, I am encountering problems with the doMC package in R and I understand that doMC is not usable in windows. Could you suggest alternative workaround for this?

Thank you so much

Reply
- Jason Brownlee November 28, 2018 at 2:54 pm #
  
  Yes, you can learn more here:
  https://machinelearningmastery.com/train-final-machine-learning-model/
  
  The doMC might not be supported on Windows. Perhaps comment it out for now?
  
  Reply
Samuel Golomeke January 16, 2019 at 3:12 am #

Hi Jason,

In the event that you have a large saved model. is there a faster way to read the model. readRDS() takes 5+ secs to read a .rds file of size 51,000 KB.

Thank you so much.

Reply
- Jason Brownlee January 16, 2019 at 5:50 am #
  
  I don’t know, sorry.
  
  Reply
Andreas February 5, 2019 at 9:35 pm #

Hello. After choosing the best model, how am I loading it and where can I find my results? For example I want to predict the price of an equity tomorrow. How can I load my optimal neural network, how can input the new inputs and where can I find my actual Results?

Thank you

Reply
- Jason Brownlee February 6, 2019 at 7:42 am #
  
  I show how to load the model and make a prediction in the above tutorial.
  
  Reply
ASHA February 9, 2019 at 12:33 am #

A very good artical much helpful and I want to ask you one thing regarding deployment of my “R-language – Telecom Churn” prediction model into my production environment) ,I am clear upto saving stand alone program and how to predict on new test data,I joined recently in my job so i want to know about this,could you help me regarding this or please share relevent links about how to consume “Prediction Model” .

Reply
- Jason Brownlee February 9, 2019 at 5:58 am #
  
  I have some suggestions here that might help:
  https://machinelearningmastery.com/deploy-machine-learning-model-to-production/
  
  Reply
Nuwan February 28, 2019 at 10:59 pm #

hi jason,
can i use that code for time series data set.

thank you.

Reply
- Jason Brownlee March 1, 2019 at 6:20 am #
  
  Perhaps.
  
  Reply
  - Nuwan March 4, 2019 at 2:07 pm #
    
    hi jason,
    
    Can you give me some tutorials of your machine learning used in time series data in r.
    
    thank you.
    
    Reply
    - Jason Brownlee March 4, 2019 at 2:19 pm #
      
      This is a common question that I answer here:
      https://machinelearningmastery.com/faq/single-faq/do-you-have-material-on-time-series-in-r
      
      Reply
Zhanyou Xu April 9, 2019 at 4:54 am #

Should we use all training data or all the data (both training and validation) to train the e standalone model ?

Since both training and validation data has the ground “truth” or labels, should we use all the data to train the final standalone model or just the training data? My thought is that we should use all the data with true labels for our final model to predict unknown data. Am I right? Can you please confirm it?

Reply
- Jason Brownlee April 9, 2019 at 6:29 am #
  
  Once you choose a model/config, fit it on all available data with labels.
  
  Reply
Ben May 1, 2019 at 8:54 pm #

Hi,

I’ve created a neural net which I’m using to classify text as valid or invalid. It performs really well in the environment I’ve trained it in but when I save the model and take it into a new environement it performs differently. I have given it the same data to predict but the outcome is completely different.

Do you have any idea why this might be?

Reply
- Jason Brownlee May 2, 2019 at 8:02 am #
  
  It sounds like the model has overfit the training data.
  
  This may help:
  https://machinelearningmastery.com/introduction-to-regularization-to-reduce-overfitting-and-improve-generalization-error/
  
  Reply
Dan June 3, 2019 at 6:20 pm #

Hi Jason, Super helpful tutorial once again!

This worked well for a series of 12 models I developed tested and saved.

However for the xgboost model when I reloaded it for prediction it shows in Rstudio global environment as length 0 size 0 B and got the following error when I tried to run the predict. Error in { : task 1 failed – “object ‘m2predict’ not found”. However the RDS file for this model is showing as 386,450 KB in my hard drive.

I found that there seems to be other functions specific for xgboost (xgb.save?) which is good to know for the future.

Other than retraining the model is there a way to get the existing RDS file with the xgboost model to work or convert it to xgb.save file type?

Thanks in advance for any suggestions!

Reply
- Dan June 3, 2019 at 7:52 pm #
  
  Hi Jason, Found the underlying bug!
  
  Had nothing to do with the RDS. That seems to be working fine.
  
  Issue was related to variable names from the base models being stacked into the XGBoost.
  
  All good here.
  
  Thanks again for the great tutorials!
  
  Reply
  - Jason Brownlee June 4, 2019 at 7:49 am #
    
    Well done, I’m happy to hear that.
    
    Reply
- Jason Brownlee June 4, 2019 at 7:48 am #
  
  Perhaps try running your script from the command line instead of in R studio?
  
  Reply
Souha July 14, 2019 at 11:32 pm #

Hi Jason, I have used the clickstream package by R (on parsed log files) , it is a statistical model to predict which link the user is going to click on next.
I’m quite confused about the deployment of the model and its update when new data comes in, I guess as new data comes in the model should be trained with all the data (the new and the old) am I right? so in this case how do I do this?
P.S: I’m new to R and trying to learn data science by myself so any help would be appreciated.

Reply
- Jason Brownlee July 15, 2019 at 8:18 am #
  
  I’m not familiar with that package, sorry.
  
  Normally, a model is fit on available data then deployed for use on making predictions on new data.
  
  Streaming data might be very different and I am not an expert in that area.
  
  Reply
NYK September 30, 2019 at 12:19 pm #

Hi Jason,

In your example above, the Sonar data set contains only numeric variables (no factors).

My data set contains both factors and numeric variables. And because of this, the forest saved by “saveRDS” will give an error when the forest is reloaded using “readRDS” to generate predictions.

The error message is “type of predictors in new data do no match that of the training data”. But I am sure and check many times the type are the same.

The error will go away if I convert all variables to numeric, just like your example.

Any suggestions? Thanks.

Reply
S David October 25, 2019 at 2:26 am #

Some Models may take quite some to be trained. How is the time factor handled when models are implemented in Production?
Thanks in advance.

Reply
- Jason Brownlee October 25, 2019 at 6:47 am #
  
  Models are trained once and then used for prediction.
  
  Retraining a model should be a rare and scheduled event – although depends on the domain and how often it changes enough to require a model update.
  
  Reply
Curt M February 5, 2020 at 12:20 am #

Once you train a model and have the confusion matrix, how can you tell which variables have the largest contribution to the model (or the biggest predictive effect)? For example, a regression analysis will tell you which variables are better predictors of the outcome. Is this possible with machine learning models? I think in Python there is a “eli5.show_prediction” function that makes this possible, but I do not know similar functions in R.

Secondly, can we control for dependency in the data when using machine learning? If we have a dataset with three levels (e.g. test scores, within individuals, within schools) can you control for dependency like you can when using multi-level models?

Reply
- Jason Brownlee February 5, 2020 at 8:16 am #
  
  Yes, some models like decision trees will provide a varaible importance scores. Most model do not.
  
  Perhaps. You can choose what you’re modeling, e.g. across test/people/schools.
  
  Reply
nieta March 20, 2020 at 1:51 am #

Hi Jason
Very good article.
What if I want to see the predictive results in the form of csv?
So, I can compare actual data with prediction data row by row.
Please help. Thank you.

Reply
- Jason Brownlee March 20, 2020 at 8:46 am #
  
  Thanks.
  
  You can make a prediction then save the prediction to CSV file.
  
  Reply
  - nieta March 20, 2020 at 11:35 am #
    
    i did, however the csv file only save the ID and prediction value which i think only have in “predict”. How can i save together with the other variables in the test file?
    
    Reply
    - Jason Brownlee March 20, 2020 at 1:19 pm #
      
      Yes, create a new matrix or data frame and save that to file. If this is new to you, you might have to check the documentation.
      
      Reply
Michael Gloven January 5, 2021 at 7:06 am #

Excellent work Jason.

I have several Caret trained models, and when saved (saveRDS, compressed) they are 100-200+ MB. I am now deploying these to Shiny and they take forever to load and often fail.

Is there a way to make the models smaller or better yet, is there a predict function that allows me to pass in the method and parameters only without all the training\pre-process\sample data?

Reply
- Jason Brownlee January 5, 2021 at 7:30 am #
  
  Wow, they are big models!
  
  Perhaps the models are saving the entire training dataset as well, you could check the object or the API and see if this is the case and then clear it.
  
  In some cases, usung/writing the predict function directly with learned coefficients (such as regression model) can be very effective.
  
  Reply

Navigation

Save And Finalize Your Machine Learning Model in R

Finalize Your Machine Learning Model

Need more Help with R for Machine Learning?

Finalize Predictive Model in R

1. Make Predictions On New Data

2. Create A Standalone Model

3. Save and Load Your Model

Summary

Next Step

Discover Faster Machine Learning in R!

Develop Your Own Models in Minutes

Finally Bring Machine Learning To Your Own Projects

More On This Topic

61 Responses to Save And Finalize Your Machine Learning Model in R

Leave a Reply Click here to cancel reply.