Finding an accurate machine learning is not the end of the project.

In this post you will discover how to finalize your machine learning model in R including: making predictions on unseen data, re-building the model from scratch and saving your model for later use.

Let’s get started.

## Finalize Your Machine Learning Model

Once you have an accurate model on your test harness you are nearly, done. But not yet.

There are still a number of tasks to do to finalize your model. The whole idea of creating an accurate model for your dataset was to make predictions on unseen data.

There are three tasks you may be concerned with:

- Making new predictions on unseen data.
- Creating a standalone model using all training data.
- Saving your model to file for later loading and making predictions on new data.

Once you have finalized your model you are ready to make use of it. You could use the R model directly. You could also discover the key internal representation found by the learning algorithm (like the coefficients in a linear model) and use them in a new implementation of the prediction algorithm on another platform.

In the next section, you will look at how you can finalize your machine learning model in R.

### Need more Help with R for Machine Learning?

Take my free 14-day email course and discover how to use R on your project (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

## Finalize Predictive Model in R

Caret is an excellent tool that you can use to find good or even best machine learning algorithms and parameters for machine learning algorithms.

But what do you do after you have discovered a model that is accurate enough to use?

Once you have found a good model in R, you have three main concerns:

- Making new predictions using your tuned caret model.
- Creating a standalone model using the entire training dataset.
- Saving/Loading a standalone model to file.

This section will step you through how to achieve each of these tasks in R.

### 1. Make Predictions On New Data

You can make new predictions using a model you have tuned using caret using the *predict.train()* function.

In the recipe below, the dataset is split into a validation dataset and a training dataset. The validation dataset could just as easily be a new dataset stored in a separate file and loaded as a data frame.

A good model of the data is found using LDA. We can see that caret provides access to the best model from a training run in the finalModel variable.

We can use that model to make predictions by calling predict using the fit from train which will automatically use the final model. We must specify the data one which to make predictions via the *newdata* argument.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
# load libraries library(caret) library(mlbench) # load dataset data(PimaIndiansDiabetes) # create 80%/20% for training and validation datasets set.seed(9) validation_index <- createDataPartition(PimaIndiansDiabetes$diabetes, p=0.80, list=FALSE) validation <- PimaIndiansDiabetes[-validation_index,] training <- PimaIndiansDiabetes[validation_index,] # train a model and summarize model set.seed(9) control <- trainControl(method="cv", number=10) fit.lda <- train(diabetes~., data=training, method="lda", metric="Accuracy", trControl=control) print(fit.lda) print(fit.lda$finalModel) # estimate skill on validation dataset set.seed(9) predictions <- predict(fit.lda, newdata=validation) confusionMatrix(predictions, validation$diabetes) |

Running the example, we can see that the estimated accuracy on the training dataset was 76.91%. Using the finalModel in the fit, we can see that the accuracy on the hold out validation dataset was 77.78%, very similar to our estimate.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
Resampling results Accuracy Kappa Accuracy SD Kappa SD 0.7691169 0.45993 0.06210884 0.1537133 ... Confusion Matrix and Statistics Reference Prediction neg pos neg 85 19 pos 15 34 Accuracy : 0.7778 95% CI : (0.7036, 0.8409) No Information Rate : 0.6536 P-Value [Acc > NIR] : 0.000586 Kappa : 0.5004 Mcnemar's Test P-Value : 0.606905 Sensitivity : 0.8500 Specificity : 0.6415 Pos Pred Value : 0.8173 Neg Pred Value : 0.6939 Prevalence : 0.6536 Detection Rate : 0.5556 Detection Prevalence : 0.6797 Balanced Accuracy : 0.7458 'Positive' Class : neg |

### 2. Create A Standalone Model

In this example, we have tuned a random forest with 3 different values for *mtry* and *ntree* set to 2000. By printing the fit and the finalModel, we can see that the most accurate value for *mtry* was 2.

Now that we know a good algorithm (random forest) and the good configuration (mtry=2, *ntree=2000*) we can create the final model directly using all of the training data. We can lookup the “*rf*” random forest implementation used by caret in the Caret List of Models and note that it is using the *randomForest* package and in turn the *randomForest()* function.

The example creates a new model directly and uses it to make predictions on the new data, this case simulated as the verification dataset.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
# load libraries library(caret) library(mlbench) library(randomForest) # load dataset data(Sonar) set.seed(7) # create 80%/20% for training and validation datasets validation_index <- createDataPartition(Sonar$Class, p=0.80, list=FALSE) validation <- Sonar[-validation_index,] training <- Sonar[validation_index,] # train a model and summarize model set.seed(7) control <- trainControl(method="repeatedcv", number=10, repeats=3) fit.rf <- train(Class~., data=training, method="rf", metric="Accuracy", trControl=control, ntree=2000) print(fit.rf) print(fit.rf$finalModel) # create standalone model using all training data set.seed(7) finalModel <- randomForest(Class~., training, mtry=2, ntree=2000) # make a predictions on "new data" using the final model final_predictions <- predict(finalModel, validation[,1:60]) confusionMatrix(final_predictions, validation$Class) |

We can see that the estimated accuracy of the optimal configuration was 85.07%. We can see that the accuracy of the final standalone model trained on all of the training dataset and predicting for the validation dataset was 82.93%.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
Random Forest 167 samples 60 predictor 2 classes: 'M', 'R' No pre-processing Resampling: Cross-Validated (10 fold, repeated 3 times) Summary of sample sizes: 151, 150, 150, 150, 151, 150, ... Resampling results across tuning parameters: mtry Accuracy Kappa Accuracy SD Kappa SD 2 0.8507353 0.6968343 0.07745360 0.1579125 31 0.8064951 0.6085348 0.09373438 0.1904946 60 0.7927696 0.5813335 0.08768147 0.1780100 Accuracy was used to select the optimal model using the largest value. The final value used for the model was mtry = 2. ... Call: randomForest(x = x, y = y, ntree = 2000, mtry = param$mtry) Type of random forest: classification Number of trees: 2000 No. of variables tried at each split: 2 OOB estimate of error rate: 14.37% Confusion matrix: M R class.error M 83 6 0.06741573 R 18 60 0.23076923 ... Confusion Matrix and Statistics Reference Prediction M R M 20 5 R 2 14 Accuracy : 0.8293 95% CI : (0.6794, 0.9285) No Information Rate : 0.5366 P-Value [Acc > NIR] : 8.511e-05 Kappa : 0.653 Mcnemar's Test P-Value : 0.4497 Sensitivity : 0.9091 Specificity : 0.7368 Pos Pred Value : 0.8000 Neg Pred Value : 0.8750 Prevalence : 0.5366 Detection Rate : 0.4878 Detection Prevalence : 0.6098 Balanced Accuracy : 0.8230 'Positive' Class : M |

Some simpler models, like linear models can output their coefficients. This is useful, because from these, you can implement the simple prediction procedure in your language of choice and use the coefficients to get the same accuracy. This gets more difficult as the complexity of the representation increases.

### 3. Save and Load Your Model

You can save your best models to a file so that you can load them up later and make predictions.

In this example we split the Sonar dataset into a training dataset and a validation dataset. We take our validation dataset as new data to test our final model. We train the final model using the training dataset and our optimal parameters, then save it to a file called final_model.rds in the local working directory.

The model is serialized. It can be loaded at a later time by calling readRDS() and assigning the object that is loaded (in this case a random forest fit) to a variable name. The loaded random forest is then used to make predictions on new data, in this case the validation dataset.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
# load libraries library(caret) library(mlbench) library(randomForest) library(doMC) registerDoMC(cores=8) # load dataset data(Sonar) set.seed(7) # create 80%/20% for training and validation datasets validation_index <- createDataPartition(Sonar$Class, p=0.80, list=FALSE) validation <- Sonar[-validation_index,] training <- Sonar[validation_index,] # create final standalone model using all training data set.seed(7) final_model <- randomForest(Class~., training, mtry=2, ntree=2000) # save the model to disk saveRDS(final_model, "./final_model.rds") # later... # load the model super_model <- readRDS("./final_model.rds") print(super_model) # make a predictions on "new data" using the final model final_predictions <- predict(super_model, validation[,1:60]) confusionMatrix(final_predictions, validation$Class) |

We can see that the accuracy on the validation dataset was 82.93%.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
Confusion Matrix and Statistics Reference Prediction M R M 20 5 R 2 14 Accuracy : 0.8293 95% CI : (0.6794, 0.9285) No Information Rate : 0.5366 P-Value [Acc > NIR] : 8.511e-05 Kappa : 0.653 Mcnemar's Test P-Value : 0.4497 Sensitivity : 0.9091 Specificity : 0.7368 Pos Pred Value : 0.8000 Neg Pred Value : 0.8750 Prevalence : 0.5366 Detection Rate : 0.4878 Detection Prevalence : 0.6098 Balanced Accuracy : 0.8230 'Positive' Class : M |

## Summary

In this post you discovered three recipes for working with final predictive models:

- How to make predictions using the best model from caret tuning.
- How to create a standalone model using the parameters found during caret tuning.
- How to save and later load a standalone model and use it to make predictions.

You can work through these recipes to understand them better. You can also use them as a template and copy-and-paste them into your current or next machine learning project.

## Next Step

Did you try out these recipes?

- Start your R interactive environment.
- Type or copy-paste the recipes above and try them out.
- Use the built-in help in R to learn more about the functions used.

Do you have a question. Ask it in the comments and I will do my best to answer it.

Hi. I’ve been working through the examples, and some models of my own, and i have a question on preprocessing my data.

When training my models, I used the preProcess argument to center and scale the data.

When I have a new data set to run through my model after training, does the model know to preprocess that data as well? Or do i have to manually scale and center the new data set before applying the model to it? What about after saving and reloading the final model?

this is one of the less-clear points in the docs.

Hi John, great question.

The safest way is to managing data scaling separately and save scaled versions of your data as well as the coefficients needed to scale new data in the future.

According to the caret doco, any preprocessing applied during training will be applied to later calls to predict() on new data:

http://topepo.github.io/caret/model-training-and-tuning.html#preproc

I would expect this to be preserved if you saved the trained model, but I would suggest testing this out (do pre-save and post-save results match for a model with preprocessing).

Are there concerns about applying this to non-classification-problems?

What would be the differences?

No difference of note.

Hm…

Error: Metric Accuracy not applicable for regression models

Error in

`[.data.frame`

(validation, , 1:60) : undefined columns selected1. & 2. fixed, Still getting errors:

Error in confusionMatrix.default(final_predictions, validation$n5) :

the data cannot have more levels than the reference

If I do a print(final_predictions), I get:

3 6 9 11 18 19 21 26 29 39 47 50 61 63 74 92 94 95 97 99 101 104 111 114

M R M R R R M R M R M R R R R R R R R M M R M M

117 120 126 129 137 138 145 146 148 155 160 164 170 171 173 191 206

M M M M M M M M M M M M M R M M M

The last data point has a row index of 206. The original data from Sonar has an end index of 208.

So where is the unseen data here? Are the row indexes in depended from each other?

How to predict one step of unseen data in the future with example 2?

Dear Jason, thanks for the precious post.

I’m going a bit too far maybe, but I’ve been wondering recently if it is possible to save a trained model in some kind of standard format so that it can be (1) sent over a network and (2) parsed even by a different language (I dunno, Java for example).

Are you aware of anything similar, for R or (possibly) other machine learning / statistics suites?

Apologies if the question is considered off-topic.

I believe there are standard model formats, but I am not an expert in them.

See the Predictive Model Markup Language

https://en.wikipedia.org/wiki/Predictive_Model_Markup_Language

Hi Jason – Thanks for the example! This came in very handy and just in time for me.

I’m glad to hear that!

Hi Jason

Great website! Thank you for the very valuable information.

I have a short question concerning the “standalone model”. Why is it necessary to use it after using the caret one? I though bought would be equivalent (i.e. caret using the randomForest one in this example of “rf”-method) and am now a bit confused.

Cheers,

Chris

Caret is a wrapper for the models that can help us find which model to use.

Once we know which model to use, we can use it directly without the caret wrapper – if we want.

Thanks!

Hi Jason,

Excellent work and quite helpful for us who is in learning mode. Just one question here. Suppose, I’ve 100 data-points and I divide into train & test (80 & 20). I fit the model and see accuracy as 75%. My question is how do I implement this model so that when 101st and 102nd data arrives, this model runs and provides me some classification?

I believe this post shows you how to save the model and load it to make predictions on new data.

Perhaps this post will make things clearer:

https://machinelearningmastery.com/train-final-machine-learning-model/

Hi Jason.

The “finalize machine learning models” article above is helpful, and I can’t wait to try it out. However, I notice that it’s making one prediction. Is here a way to get multiple predictions at once, based on criteria, and give them as a list? Like this:

Item 1…result

Item 2…result

Yes, the predict function will support this by default.

Dear Sir,

I am doing dissertation in load balancing in distributed computing system in which I want to predict the future incoming jobs on the basis of past load information given in the real time dataset.So how can I apply this technique in my work?

This might be a good place for you to start:

https://machinelearningmastery.com/start-here/#timeseries

Hello Jason , I have a question, maybe you can help me.

I am using caret, and I need to print the function of a ramdom forest best model, I am talking about all the trees, that are 100.

I know with model$finalModel I can print the function of neuronal network or tree with rpart, but this command don’t print all the trees of random forest or the function of boosted trees

Do you Know how I can do it

Thanks

Sorry, I have not printed all of the trees from a random forest before. Perhaps there is a third party tool to help?

Hey Jason! Thanks for the post. I have a question to you.

I have a built a highly accurate random forest model for a student enrollment prediction project ,let’s say, for this year. Now I have saved the model and want to use this model for data belonging to ,let’s assume, last year. But the number of variables in my last year’s data has changed. How to approach this situation? Is there any way to scale this model to fit the new case?

Thank you!

This process will help you work through your problem systematically:

https://machinelearningmastery.com/start-here/#process

Thanks Jason!

Hello guys! What ML tool (regression model) would be appropriate for predicting call center call volume

This process will help you work through your predictive modeling problem:

https://machinelearningmastery.com/start-here/#process

Hi Jason,

many thanks for this superb post!

Would like to know if the following process (essentially what you did in this post) is considered best practice or common practice in ML and DS:

1. Perform cross-validation to tune hyperparameters

2. using optimised hyperparameters to train the ENTIRE training dataset to arrive at an optimal accuracy?

And also, in part 3 on saving and loading the model, I am encountering problems with the doMC package in R and I understand that doMC is not usable in windows. Could you suggest alternative workaround for this?

Thank you so much

Yes, you can learn more here:

https://machinelearningmastery.com/train-final-machine-learning-model/

The doMC might not be supported on Windows. Perhaps comment it out for now?