 # How to Develop a Random Forest Ensemble in Python

Last Updated on April 27, 2021

Random forest is an ensemble machine learning algorithm.

It is perhaps the most popular and widely used machine learning algorithm given its good or excellent performance across a wide range of classification and regression predictive modeling problems.

It is also easy to use given that it has few key hyperparameters and sensible heuristics for configuring these hyperparameters.

In this tutorial, you will discover how to develop a random forest ensemble for classification and regression.

After completing this tutorial, you will know:

• Random forest ensemble is an ensemble of decision trees and a natural extension of bagging.
• How to use the random forest ensemble for classification and regression with scikit-learn.
• How to explore the effect of random forest model hyperparameters on model performance.

Kick-start your project with my new book Ensemble Learning Algorithms With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

• Update Aug/2020: Added a common questions section. How to Develop a Random Forest Ensemble in Python
Photo by Sheila Sund, some rights reserved.

## Tutorial Overview

This tutorial is divided into four parts; they are:

1. Random Forest Algorithm
2. Random Forest Scikit-Learn API
1. Random Forest for Classification
2. Random Forest for Regression
3. Random Forest Hyperparameters
1. Explore Number of Samples
2. Explore Number of Features
3. Explore Number of Trees
4. Explore Tree Depth
4. Common Questions

## Random Forest Algorithm

Random forest is an ensemble of decision tree algorithms.

It is an extension of bootstrap aggregation (bagging) of decision trees and can be used for classification and regression problems.

In bagging, a number of decision trees are created where each tree is created from a different bootstrap sample of the training dataset. A bootstrap sample is a sample of the training dataset where a sample may appear more than once in the sample, referred to as sampling with replacement.

Bagging is an effective ensemble algorithm as each decision tree is fit on a slightly different training dataset, and in turn, has a slightly different performance. Unlike normal decision tree models, such as classification and regression trees (CART), trees used in the ensemble are unpruned, making them slightly overfit to the training dataset. This is desirable as it helps to make each tree more different and have less correlated predictions or prediction errors.

Predictions from the trees are averaged across all decision trees resulting in better performance than any single tree in the model.

Each model in the ensemble is then used to generate a prediction for a new sample and these m predictions are averaged to give the forest’s prediction

— Page 199, Applied Predictive Modeling, 2013.

A prediction on a regression problem is the average of the prediction across the trees in the ensemble. A prediction on a classification problem is the majority vote for the class label across the trees in the ensemble.

• Regression: Prediction is the average prediction across the decision trees.
• Classification: Prediction is the majority vote class label predicted across the decision trees.

As with bagging, each tree in the forest casts a vote for the classification of a new sample, and the proportion of votes in each class across the ensemble is the predicted probability vector.

— Page 387, Applied Predictive Modeling, 2013.

Random forest involves constructing a large number of decision trees from bootstrap samples from the training dataset, like bagging.

Unlike bagging, random forest also involves selecting a subset of input features (columns or variables) at each split point in the construction of trees. Typically, constructing a decision tree involves evaluating the value for each input variable in the data in order to select a split point. By reducing the features to a random subset that may be considered at each split point, it forces each decision tree in the ensemble to be more different.

Random forests provide an improvement over bagged trees by way of a small tweak that decorrelates the trees. […] But when building these decision trees, each time a split in a tree is considered, a random sample of m predictors is chosen as split candidates from the full set of p predictors.

— Page 320, An Introduction to Statistical Learning with Applications in R, 2014.

The effect is that the predictions, and in turn, prediction errors, made by each tree in the ensemble are more different or less correlated. When the predictions from these less correlated trees are averaged to make a prediction, it often results in better performance than bagged decision trees.

Perhaps the most important hyperparameter to tune for the random forest is the number of random features to consider at each split point.

Random forests’ tuning parameter is the number of randomly selected predictors, k, to choose from at each split, and is commonly referred to as mtry. In the regression context, Breiman (2001) recommends setting mtry to be one-third of the number of predictors.

— Page 199, Applied Predictive Modeling, 2013.

A good heuristic for regression is to set this hyperparameter to 1/3 the number of input features.

• num_features_for_split = total_input_features / 3

For classification problems, Breiman (2001) recommends setting mtry to the square root of the number of predictors.

— Page 387, Applied Predictive Modeling, 2013.

A good heuristic for classification is to set this hyperparameter to the square root of the number of input features.

• num_features_for_split = sqrt(total_input_features)

Another important hyperparameter to tune is the depth of the decision trees. Deeper trees are often more overfit to the training data, but also less correlated, which in turn may improve the performance of the ensemble. Depths from 1 to 10 levels may be effective.

Finally, the number of decision trees in the ensemble can be set. Often, this is increased until no further improvement is seen.

### Want to Get Started With Ensemble Learning?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

## Random Forest Scikit-Learn API

Random Forest ensembles can be implemented from scratch, although this can be challenging for beginners.

The scikit-learn Python machine learning library provides an implementation of Random Forest for machine learning.

It is available in modern versions of the library.

First, confirm that you are using a modern version of the library by running the following script:

Running the script will print your version of scikit-learn.

Your version should be the same or higher. If not, you must upgrade your version of the scikit-learn library.

Random Forest is provided via the RandomForestRegressor and RandomForestClassifier classes.

Both models operate the same way and take the same arguments that influence how the decision trees are created.

Randomness is used in the construction of the model. This means that each time the algorithm is run on the same data, it will produce a slightly different model.

When using machine learning algorithms that have a stochastic learning algorithm, it is good practice to evaluate them by averaging their performance across multiple runs or repeats of cross-validation. When fitting a final model, it may be desirable to either increase the number of trees until the variance of the model is reduced across repeated evaluations, or to fit multiple final models and average their predictions.

Let’s take a look at how to develop a Random Forest ensemble for both classification and regression tasks.

### Random Forest for Classification

In this section, we will look at using Random Forest for a classification problem.

First, we can use the make_classification() function to create a synthetic binary classification problem with 1,000 examples and 20 input features.

The complete example is listed below.

Running the example creates the dataset and summarizes the shape of the input and output components.

Next, we can evaluate a random forest algorithm on this dataset.

We will evaluate the model using repeated stratified k-fold cross-validation, with three repeats and 10 folds. We will report the mean and standard deviation of the accuracy of the model across all repeats and folds.

Running the example reports the mean and standard deviation accuracy of the model.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see the random forest ensemble with default hyperparameters achieves a classification accuracy of about 90.5 percent.

We can also use the random forest model as a final model and make predictions for classification.

First, the random forest ensemble is fit on all available data, then the predict() function can be called to make predictions on new data.

The example below demonstrates this on our binary classification dataset.

Running the example fits the random forest ensemble model on the entire dataset and is then used to make a prediction on a new row of data, as we might when using the model in an application.

Now that we are familiar with using random forest for classification, let’s look at the API for regression.

### Random Forest for Regression

In this section, we will look at using random forests for a regression problem.

First, we can use the make_regression() function to create a synthetic regression problem with 1,000 examples and 20 input features.

The complete example is listed below.

Running the example creates the dataset and summarizes the shape of the input and output components.

Next, we can evaluate a random forest algorithm on this dataset.

As we did with the last section, we will evaluate the model using repeated k-fold cross-validation, with three repeats and 10 folds. We will report the mean absolute error (MAE) of the model across all repeats and folds. The scikit-learn library makes the MAE negative so that it is maximized instead of minimized. This means that larger negative MAE are better and a perfect model has a MAE of 0.

The complete example is listed below.

Running the example reports the mean and standard deviation MAE of the model.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see the random forest ensemble with default hyperparameters achieves a MAE of about 90.

We can also use the random forest model as a final model and make predictions for regression.

First, the random forest ensemble is fit on all available data, then the predict() function can be called to make predictions on new data.

The example below demonstrates this on our regression dataset.

Running the example fits the random forest ensemble model on the entire dataset and is then used to make a prediction on a new row of data, as we might when using the model in an application.

Now that we are familiar with using the scikit-learn API to evaluate and use random forest ensembles, let’s look at configuring the model.

## Random Forest Hyperparameters

In this section, we will take a closer look at some of the hyperparameters you should consider tuning for the random forest ensemble and their effect on model performance.

### Explore Number of Samples

Each decision tree in the ensemble is fit on a bootstrap sample drawn from the training dataset.

This can be turned off by setting the “bootstrap” argument to False, if you desire. In that case, the whole training dataset will be used to train each decision tree. This is not recommended.

The “max_samples” argument can be set to a float between 0 and 1 to control the percentage of the size of the training dataset to make the bootstrap sample used to train each decision tree.

For example, if the training dataset has 100 rows, the max_samples argument could be set to 0.5 and each decision tree will be fit on a bootstrap sample with (100 * 0.5) or 50 rows of data.

A smaller sample size will make trees more different, and a larger sample size will make the trees more similar. Setting max_samples to “None” will make the sample size the same size as the training dataset and this is the default.

The example below demonstrates the effect of different bootstrap sample sizes from 10 percent to 100 percent on the random forest algorithm.

Running the example first reports the mean accuracy for each dataset size.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, the results suggest that using a bootstrap sample size that is equal to the size of the training dataset achieves the best results on this dataset.

This is the default and it should probably be used in most cases.

A box and whisker plot is created for the distribution of accuracy scores for each bootstrap sample size.

In this case, we can see a general trend that the larger the sample, the better the performance of the model.

You might like to extend this example and see what happens if the bootstrap sample size is larger or even much larger than the training dataset (e.g. you can set an integer value as the number of samples instead of a float percentage of the training dataset size). Box Plot of Random Forest Bootstrap Sample Size vs. Classification Accuracy

### Explore Number of Features

The number of features that is randomly sampled for each split point is perhaps the most important feature to configure for random forest.

It is set via the max_features argument and defaults to the square root of the number of input features. In this case, for our test dataset, this would be sqrt(20) or about four features.

The example below explores the effect of the number of features randomly selected at each split point on model accuracy. We will try values from 1 to 7 and would expect a small value, around four, to perform well based on the heuristic.

Running the example first reports the mean accuracy for each feature set size.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, the results suggest that a value between three and five would be appropriate, confirming the sensible default of four on this dataset. A value of five might even be better given the smaller standard deviation in classification accuracy as compared to a value of three or four.

A box and whisker plot is created for the distribution of accuracy scores for each feature set size.

We can see a trend in performance rising and peaking with values between three and five and falling again as larger feature set sizes are considered. Box Plot of Random Forest Feature Set Size vs. Classification Accuracy

### Explore Number of Trees

The number of trees is another key hyperparameter to configure for the random forest.

Typically, the number of trees is increased until the model performance stabilizes. Intuition might suggest that more trees will lead to overfitting, although this is not the case. Both bagging and random forest algorithms appear to be somewhat immune to overfitting the training dataset given the stochastic nature of the learning algorithm.

The number of trees can be set via the “n_estimators” argument and defaults to 100.

The example below explores the effect of the number of trees with values between 10 to 1,000.

Running the example first reports the mean accuracy for each configured number of trees.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that performance rises and stays flat after about 100 trees. Mean accuracy scores fluctuate across 100, 500, and 1,000 trees and this may be statistical noise.

A box and whisker plot is created for the distribution of accuracy scores for each configured number of trees. Box Plot of Random Forest Ensemble Size vs. Classification Accuracy

### Explore Tree Depth

A final interesting hyperparameter is the maximum depth of decision trees used in the ensemble.

By default, trees are constructed to an arbitrary depth and are not pruned. This is a sensible default, although we can also explore fitting trees with different fixed depths.

The maximum tree depth can be specified via the max_depth argument and is set to None (no maximum depth) by default.

The example below explores the effect of random forest maximum tree depth on model performance.

Running the example first reports the mean accuracy for each configured maximum tree depth.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that larger depth results in better model performance, with the default of no maximum depth achieving the best performance on this dataset.

A box and whisker plot is created for the distribution of accuracy scores for each configured maximum tree depth.

In this case, we can see a trend of improved performance with increase in tree depth, supporting the default of no maximum depth. Box Plot of Random Forest Maximum Tree Depth vs. Classification Accuracy

## Common Questions

In this section we will take a closer look at some common sticking points you may have with the radom forest ensemble procedure.

Q. What algorithm should be used in the ensemble?

Random forest is designed to be an ensemble of decision tree algorithms.

Q. How many ensemble members should be used?

The number of trees should be increased until no further improvement in performance is seen on your dataset.

As a starting point, we suggest using at least 1,000 trees. If the cross-validation performance profiles are still improving at 1,000 trees, then incorporate more trees until performance levels off.

— Page 200, Applied Predictive Modeling, 2013.

Q. Won’t the ensemble overfit with too many trees?

No. Random forest ensembles (do not) are very unlikely to overfit in general.

Another claim is that random forests “cannot overfit” the data. It is certainly true that increasing [the number of trees] does not cause the random forest sequence to overfit …

— Page 596, The Elements of Statistical Learning, 2016.

Q. How large should the bootstrap sample be?

It is good practice to make the bootstrap sample as large as the original dataset size.

That is 100% the size or an equal number of rows as the original dataset.

Q. How many features should be chosen at each split point?

The best practice is to test a suite of different values and discover what works best for your dataset.

As a heuristic, you can use:

• Classification: Square root of the number of features.
• Regression: One third of the number of features.

Q. What problems are well suited to random forest?

Random forest is known to work well or even best on a wide range of classification and regression problems. Try it and see.

The authors make grand claims about the success of random forests: “most accurate”, “most interpretable”, and the like. In our experience random forests do remarkably well, with very little tuning required.

— Page 590, The Elements of Statistical Learning, 2016.

This section provides more resources on the topic if you are looking to go deeper.

## Summary

In this tutorial, you discovered how to develop random forest ensembles for classification and regression.

Specifically, you learned:

• Random forest ensemble is an ensemble of decision trees and a natural extension of bagging.
• How to use the random forest ensemble for classification and regression with scikit-learn.
• How to explore the effect of random forest model hyperparameters on model performance.

Do you have any questions?

## Get a Handle on Modern Ensemble Learning! #### Improve Your Predictions in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Ensemble Learning Algorithms With Python

It provides self-study tutorials with full working code on:
Stacking, Voting, Boosting, Bagging, Blending, Super Learner, and much more...

### 35 Responses to How to Develop a Random Forest Ensemble in Python

1. dcart April 20, 2020 at 6:20 am #

Hi Jason

Hope you are doing well in this time of lock down.

Great articles as usual. Although I like to apply RF in my regression problem. However, I have issue with memory as the data is huge.

Any advise using RF for huge data set?

Thanks
Dennis

• Jason Brownlee April 20, 2020 at 7:36 am #

Perhaps prepare a prototype on a small sample of data first to see if it is effective.

2. Elie April 23, 2020 at 4:05 am #

Hi Jason, Are you planning a new book on EnsembleS?

• Jason Brownlee April 23, 2020 at 6:12 am #

Not sure at this stage.

Are there ensemble topics you’d like me to write about?

3. Elie April 24, 2020 at 5:17 am #

probably advanced stacking and how to win kaggle/data science competitions

• Jason Brownlee April 24, 2020 at 5:54 am #
4. Stelios April 26, 2020 at 4:34 pm #

Thank you so much for your great support!

• Jason Brownlee April 27, 2020 at 5:30 am #

You’re welcome.

5. David Sanchez May 10, 2020 at 11:25 pm #

Hi Jason,

I’m implementing a Random Forest and I’m getting a shifted time-series in the predictions. If I build the model for predicting e.g. 4 steps ahead, my time-series of predictions seems 4 steps shifted to the right comparing to my time-series of observations. If I try to predict 16 steps ahead, it seems 16 steps shifted.

Any idea why this could be happening?

• Jason Brownlee May 11, 2020 at 6:00 am #

Yes, it sounds like the model has learned a persistence (no skill) forecast. E.g. it predicts the input as the output.

• David Sanchez May 14, 2020 at 6:28 pm #

Hi Jason,

I’m also wondering, if I try to build a model where my train set has more variables than my test set, how should I proceed?

As far as I’ve seen about it, I should recreate those missing variables in my test dataframe and set them as 0.

• Jason Brownlee May 15, 2020 at 5:57 am #

The number of variables (columns) must be the same in train and test sets.

6. Grzegorz Kępisty May 26, 2020 at 10:46 pm #

Very nice tutorial of RF usage!
It is really practical to know good practices on those models – from my experience Random Forests are very competitive in real industrial applications! (often outperforms such competitors as Artificial Neural Networks).
Regards!

• Jason Brownlee May 27, 2020 at 7:54 am #

Thanks!

Agreed. XGBoost more so.

7. DIIO May 28, 2020 at 10:03 am #

Hi Jason,
“This means that larger negative MAE are better and a perfect model has a MAE of 0.”
Thank you! enjoying a lot this stuff.

• Jason Brownlee May 28, 2020 at 1:24 pm #

It is correct.

-10 is greater than -100.

0 is greater than -10.

8. manuela October 16, 2020 at 4:26 pm #

Hello Jason, Please I have a question
I have the following situation that is already programmed with Logistic regression, I have tried the same program with Random Forest in order to check how it could improve the accuracy.
Actually, the accuracy was improved, but I don’t know if it is logical to use the Random Forest in my problem case.

My case study is as follow :
Based on a market dataset, I need to predict if a customer will buy a product or not depending on his prior history. I.e to know how much a customer bought the same product previously, and how much he just check it without buying it

The used data has the following structure:

Id clients CurrectProd P1+ P1- P2+ P2- P3+ P3- … PN+ PN- Output
10 CL1 P1, P3 6 1 0 0 8 2 0 0 1
11 CL1 P1, P2 7 1 5 2 0 0 0 0 1

with:
CurrentProd: means a list of products that I need to know if a customer will purchase,
P1+: mean how many time à client buy product 1,
P1-: refers to the number that a client checked a product 1 without buying it.

columns present all products existing in the market so that I have data with too many features (min 200 PRODUCT) and at each row the most of those row take value 0 (becose there are not belong to CurrentPRod

So I want to know if the random forest could be used in this situation

PS: I must use the data as it is without any change in features or structure

• manuela October 16, 2020 at 4:43 pm #

Id..|..clients..|..CurrectProd..|.P1+.|.P1-.|.P2+.|.P2-.|.P3+.|.P3-.|. … .|.PN+.|.PN-.|.Output
10.|….CL1….|……P1, P3……|…6….|..1…|…0….|…0…|..8….|…2…|. … .|…0…|…0….|….1
11.|….CL1….|……P1, P2……|…7….|..1…|…5….|…2…|…0…|…0…|. … .|…0…|…0….|….1

• Jason Brownlee October 17, 2020 at 5:58 am #

Perhaps try it and compare results.

The key will to find an appropriate representation for the problem. This may give you ideas (replace site with product):
https://machinelearningmastery.com/faq/single-faq/how-to-develop-forecast-models-for-multiple-sites

• manuela October 17, 2020 at 6:45 am #

I have already tried it, and it gives me a good result,
but I want to know if it is logical to use it with 200 features (Product1, Product2….)

• Jason Brownlee October 17, 2020 at 1:41 pm #

Use the features that result in the best performance, regardless of how many.

9. dinesh December 21, 2020 at 7:29 pm #

how to decide these paramters
n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=3

• Jason Brownlee December 22, 2020 at 6:42 am #

This defines the test problem, it is completely arbitrary.

The ideas is you replace this with your own dataset.

• dinesh December 25, 2020 at 1:51 pm #

what are the best practice , also if i want to learn about the meaning of these parameter.

• Jason Brownlee December 26, 2020 at 5:08 am #

Best practice for test problems? I don’t understand.

Best practice for hyperparameter tuning is in the above example, e.g. grid search.

10. Jim February 17, 2021 at 6:03 pm #

Perfect!

• Jason Brownlee February 18, 2021 at 5:12 am #

Thanks!

• Alex September 8, 2021 at 12:29 am #

Thanks for such comprehensive article,

I wonder to know is there any way to find out that under which condition my model has wrong prediction, i mean is there any way to find (range) values for features that tell me that the prediction of the learned model is not reliable. Or let machine learn that when the prediction could not be reliable.

I think it could be, some how , the other way around of machine learning ,isnt it?

Any suggestions for it?

• Adrian Tam September 8, 2021 at 2:02 am #

Quite impossible to know because a machine learning model is learned from the data provided. You’re simply asking what the provided data did not cover.

11. Francisco Pérez Liébana March 12, 2021 at 11:36 pm #

Hello Jason,

Thanks for your articles, they are very useful!

Do you know how can I get a graphic representation of the trees in the trained model ? I was trying to use export_graphviz in sklearn, but using “cross_val_scores” function fitting estimator on its own, i don´t know how to use export_gaphviz function.

Francisco

• Jason Brownlee March 13, 2021 at 5:33 am #

I believe it’s possible but I have not done it before, sorry Francisco.

12. Ilan Sharfer May 12, 2021 at 11:59 pm #

Hi Jason,

Thanks for the clear and useful introduction.

I have a question on how the Random Forest algorithm handles missing features.

For example suppose the data set is a 24H time series, for which I want to build a classifier.

Some of the features are available only in daytime, some only in night-time, and some others are partly unavailable.

Ilan

• Jason Brownlee May 13, 2021 at 6:03 am #

Perhaps try a suite of approaches for handling the missing data and discover what works well or best for your dataset.

13. Ahmad Afif Aulia Hariz July 13, 2021 at 10:49 pm #

Thanks a lot for the article.

I’ve just build my own RF Regressor, i have (2437, 45) shape. I have run my model and got r-square about 0.7

I want to improve it into 0.95. Any suggestion?
Thanks for help!