How to Develop a Random Forest Ensemble in Python

Last Updated on September 7, 2020

Random forest is an ensemble machine learning algorithm.

It is perhaps the most popular and widely used machine learning algorithm given its good or excellent performance across a wide range of classification and regression predictive modeling problems.

It is also easy to use given that it has few key hyperparameters and sensible heuristics for configuring these hyperparameters.

In this tutorial, you will discover how to develop a random forest ensemble for classification and regression.

After completing this tutorial, you will know:

  • Random forest ensemble is an ensemble of decision trees and a natural extension of bagging.
  • How to use the random forest ensemble for classification and regression with scikit-learn.
  • How to explore the effect of random forest model hyperparameters on model performance.

Let’s get started.

  • Update Aug/2020: Added a common questions section.
How to Develop a Random Forest Ensemble in Python

How to Develop a Random Forest Ensemble in Python
Photo by Sheila Sund, some rights reserved.

Tutorial Overview

This tutorial is divided into four parts; they are:

  1. Random Forest Algorithm
  2. Random Forest Scikit-Learn API
    1. Random Forest for Classification
    2. Random Forest for Regression
  3. Random Forest Hyperparameters
    1. Explore Number of Samples
    2. Explore Number of Features
    3. Explore Number of Trees
    4. Explore Tree Depth
  4. Common Questions

Random Forest Algorithm

Random forest is an ensemble of decision tree algorithms.

It is an extension of bootstrap aggregation (bagging) of decision trees and can be used for classification and regression problems.

In bagging, a number of decision trees are created where each tree is created from a different bootstrap sample of the training dataset. A bootstrap sample is a sample of the training dataset where a sample may appear more than once in the sample, referred to as sampling with replacement.

Bagging is an effective ensemble algorithm as each decision tree is fit on a slightly different training dataset, and in turn, has a slightly different performance. Unlike normal decision tree models, such as classification and regression trees (CART), trees used in the ensemble are unpruned, making them slightly overfit to the training dataset. This is desirable as it helps to make each tree more different and have less correlated predictions or prediction errors.

Predictions from the trees are averaged across all decision trees resulting in better performance than any single tree in the model.

Each model in the ensemble is then used to generate a prediction for a new sample and these m predictions are averaged to give the forest’s prediction

— Page 199, Applied Predictive Modeling, 2013.

A prediction on a regression problem is the average of the prediction across the trees in the ensemble. A prediction on a classification problem is the majority vote for the class label across the trees in the ensemble.

  • Regression: Prediction is the average prediction across the decision trees.
  • Classification: Prediction is the majority vote class label predicted across the decision trees.

As with bagging, each tree in the forest casts a vote for the classification of a new sample, and the proportion of votes in each class across the ensemble is the predicted probability vector.

— Page 387, Applied Predictive Modeling, 2013.

Random forest involves constructing a large number of decision trees from bootstrap samples from the training dataset, like bagging.

Unlike bagging, random forest also involves selecting a subset of input features (columns or variables) at each split point in the construction of trees. Typically, constructing a decision tree involves evaluating the value for each input variable in the data in order to select a split point. By reducing the features to a random subset that may be considered at each split point, it forces each decision tree in the ensemble to be more different.

Random forests provide an improvement over bagged trees by way of a small tweak that decorrelates the trees. […] But when building these decision trees, each time a split in a tree is considered, a random sample of m predictors is chosen as split candidates from the full set of p predictors.

— Page 320, An Introduction to Statistical Learning with Applications in R, 2014.

The effect is that the predictions, and in turn, prediction errors, made by each tree in the ensemble are more different or less correlated. When the predictions from these less correlated trees are averaged to make a prediction, it often results in better performance than bagged decision trees.

Perhaps the most important hyperparameter to tune for the random forest is the number of random features to consider at each split point.

Random forests’ tuning parameter is the number of randomly selected predictors, k, to choose from at each split, and is commonly referred to as mtry. In the regression context, Breiman (2001) recommends setting mtry to be one-third of the number of predictors.

— Page 199, Applied Predictive Modeling, 2013.

A good heuristic for regression is to set this hyperparameter to 1/3 the number of input features.

  • num_features_for_split = total_input_features / 3

For classification problems, Breiman (2001) recommends setting mtry to the square root of the number of predictors.

— Page 387, Applied Predictive Modeling, 2013.

A good heuristic for classification is to set this hyperparameter to the square root of the number of input features.

  • num_features_for_split = sqrt(total_input_features)

Another important hyperparameter to tune is the depth of the decision trees. Deeper trees are often more overfit to the training data, but also less correlated, which in turn may improve the performance of the ensemble. Depths from 1 to 10 levels may be effective.

Finally, the number of decision trees in the ensemble can be set. Often, this is increased until no further improvement is seen.

Random Forest Scikit-Learn API

Random Forest ensembles can be implemented from scratch, although this can be challenging for beginners.

The scikit-learn Python machine learning library provides an implementation of Random Forest for machine learning.

It is available in modern versions of the library.

First, confirm that you are using a modern version of the library by running the following script:

Running the script will print your version of scikit-learn.

Your version should be the same or higher. If not, you must upgrade your version of the scikit-learn library.

Random Forest is provided via the RandomForestRegressor and RandomForestClassifier classes.

Both models operate the same way and take the same arguments that influence how the decision trees are created.

Randomness is used in the construction of the model. This means that each time the algorithm is run on the same data, it will produce a slightly different model.

When using machine learning algorithms that have a stochastic learning algorithm, it is good practice to evaluate them by averaging their performance across multiple runs or repeats of cross-validation. When fitting a final model, it may be desirable to either increase the number of trees until the variance of the model is reduced across repeated evaluations, or to fit multiple final models and average their predictions.

Let’s take a look at how to develop a Random Forest ensemble for both classification and regression tasks.

Random Forest for Classification

In this section, we will look at using Random Forest for a classification problem.

First, we can use the make_classification() function to create a synthetic binary classification problem with 1,000 examples and 20 input features.

The complete example is listed below.

Running the example creates the dataset and summarizes the shape of the input and output components.

Next, we can evaluate a random forest algorithm on this dataset.

We will evaluate the model using repeated stratified k-fold cross-validation, with three repeats and 10 folds. We will report the mean and standard deviation of the accuracy of the model across all repeats and folds.

Running the example reports the mean and standard deviation accuracy of the model.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see the random forest ensemble with default hyperparameters achieves a classification accuracy of about 90.5 percent.

We can also use the random forest model as a final model and make predictions for classification.

First, the random forest ensemble is fit on all available data, then the predict() function can be called to make predictions on new data.

The example below demonstrates this on our binary classification dataset.

Running the example fits the random forest ensemble model on the entire dataset and is then used to make a prediction on a new row of data, as we might when using the model in an application.

Now that we are familiar with using random forest for classification, let’s look at the API for regression.

Random Forest for Regression

In this section, we will look at using random forests for a regression problem.

First, we can use the make_regression() function to create a synthetic regression problem with 1,000 examples and 20 input features.

The complete example is listed below.

Running the example creates the dataset and summarizes the shape of the input and output components.

Next, we can evaluate a random forest algorithm on this dataset.

As we did with the last section, we will evaluate the model using repeated k-fold cross-validation, with three repeats and 10 folds. We will report the mean absolute error (MAE) of the model across all repeats and folds. The scikit-learn library makes the MAE negative so that it is maximized instead of minimized. This means that larger negative MAE are better and a perfect model has a MAE of 0.

The complete example is listed below.

Running the example reports the mean and standard deviation MAE of the model.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see the random forest ensemble with default hyperparameters achieves a MAE of about 90.

We can also use the random forest model as a final model and make predictions for regression.

First, the random forest ensemble is fit on all available data, then the predict() function can be called to make predictions on new data.

The example below demonstrates this on our regression dataset.

Running the example fits the random forest ensemble model on the entire dataset and is then used to make a prediction on a new row of data, as we might when using the model in an application.

Now that we are familiar with using the scikit-learn API to evaluate and use random forest ensembles, let’s look at configuring the model.

Random Forest Hyperparameters

In this section, we will take a closer look at some of the hyperparameters you should consider tuning for the random forest ensemble and their effect on model performance.

Explore Number of Samples

Each decision tree in the ensemble is fit on a bootstrap sample drawn from the training dataset.

This can be turned off by setting the “bootstrap” argument to False, if you desire. In that case, the whole training dataset will be used to train each decision tree. This is not recommended.

The “max_samples” argument can be set to a float between 0 and 1 to control the percentage of the size of the training dataset to make the bootstrap sample used to train each decision tree.

For example, if the training dataset has 100 rows, the max_samples argument could be set to 0.5 and each decision tree will be fit on a bootstrap sample with (100 * 0.5) or 50 rows of data.

A smaller sample size will make trees more different, and a larger sample size will make the trees more similar. Setting max_samples to “None” will make the sample size the same size as the training dataset and this is the default.

The example below demonstrates the effect of different bootstrap sample sizes from 10 percent to 100 percent on the random forest algorithm.

Running the example first reports the mean accuracy for each dataset size.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, the results suggest that using a bootstrap sample size that is equal to the size of the training dataset achieves the best results on this dataset.

This is the default and it should probably be used in most cases.

A box and whisker plot is created for the distribution of accuracy scores for each bootstrap sample size.

In this case, we can see a general trend that the larger the sample, the better the performance of the model.

You might like to extend this example and see what happens if the bootstrap sample size is larger or even much larger than the training dataset (e.g. you can set an integer value as the number of samples instead of a float percentage of the training dataset size).

Box Plot of Random Forest Bootstrap Sample Size vs. Classification Accuracy

Box Plot of Random Forest Bootstrap Sample Size vs. Classification Accuracy

Explore Number of Features

The number of features that is randomly sampled for each split point is perhaps the most important feature to configure for random forest.

It is set via the max_features argument and defaults to the square root of the number of input features. In this case, for our test dataset, this would be sqrt(20) or about four features.

The example below explores the effect of the number of features randomly selected at each split point on model accuracy. We will try values from 1 to 7 and would expect a small value, around four, to perform well based on the heuristic.

Running the example first reports the mean accuracy for each feature set size.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, the results suggest that a value between three and five would be appropriate, confirming the sensible default of four on this dataset. A value of five might even be better given the smaller standard deviation in classification accuracy as compared to a value of three or four.

A box and whisker plot is created for the distribution of accuracy scores for each feature set size.

We can see a trend in performance rising and peaking with values between three and five and falling again as larger feature set sizes are considered.

Box Plot of Random Forest Feature Set Size vs. Classification Accuracy

Box Plot of Random Forest Feature Set Size vs. Classification Accuracy

Explore Number of Trees

The number of trees is another key hyperparameter to configure for the random forest.

Typically, the number of trees is increased until the model performance stabilizes. Intuition might suggest that more trees will lead to overfitting, although this is not the case. Both bagging and random forest algorithms appear to be somewhat immune to overfitting the training dataset given the stochastic nature of the learning algorithm.

The number of trees can be set via the “n_estimators” argument and defaults to 100.

The example below explores the effect of the number of trees with values between 10 to 1,000.

Running the example first reports the mean accuracy for each configured number of trees.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that performance rises and stays flat after about 100 trees. Mean accuracy scores fluctuate across 100, 500, and 1,000 trees and this may be statistical noise.

A box and whisker plot is created for the distribution of accuracy scores for each configured number of trees.

Box Plot of Random Forest Ensemble Size vs. Classification Accuracy

Box Plot of Random Forest Ensemble Size vs. Classification Accuracy

Explore Tree Depth

A final interesting hyperparameter is the maximum depth of decision trees used in the ensemble.

By default, trees are constructed to an arbitrary depth and are not pruned. This is a sensible default, although we can also explore fitting trees with different fixed depths.

The maximum tree depth can be specified via the max_depth argument and is set to None (no maximum depth) by default.

The example below explores the effect of random forest maximum tree depth on model performance.

Running the example first reports the mean accuracy for each configured maximum tree depth.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that larger depth results in better model performance, with the default of no maximum depth achieving the best performance on this dataset.

A box and whisker plot is created for the distribution of accuracy scores for each configured maximum tree depth.

In this case, we can see a trend of improved performance with increase in tree depth, supporting the default of no maximum depth.

Box Plot of Random Forest Maximum Tree Depth vs. Classification Accuracy

Box Plot of Random Forest Maximum Tree Depth vs. Classification Accuracy

Common Questions

In this section we will take a closer look at some common sticking points you may have with the radom forest ensemble procedure.

Q. What algorithm should be used in the ensemble?

Random forest is designed to be an ensemble of decision tree algorithms.

Q. How many ensemble members should be used?

The number of trees should be increased until no further improvement in performance is seen on your dataset.

As a starting point, we suggest using at least 1,000 trees. If the cross-validation performance profiles are still improving at 1,000 trees, then incorporate more trees until performance levels off.

— Page 200, Applied Predictive Modeling, 2013.

Q. Won’t the ensemble overfit with too many trees?

No. Random forest ensembles (do not) are very unlikely to overfit in general.

Another claim is that random forests “cannot overfit” the data. It is certainly true that increasing [the number of trees] does not cause the random forest sequence to overfit …

— Page 596, The Elements of Statistical Learning, 2016.

Q. How large should the bootstrap sample be?

It is good practice to make the bootstrap sample as large as the original dataset size.

That is 100% the size or an equal number of rows as the original dataset.

Q. How many features should be chosen at each split point?

The best practice is to test a suite of different values and discover what works best for your dataset.

As a heuristic, you can use:

  • Classification: Square root of the number of features.
  • Regression: One third of the number of features.

Q. What problems are well suited to random forest?

Random forest is known to work well or even best on a wide range of classification and regression problems. Try it and see.

The authors make grand claims about the success of random forests: “most accurate”, “most interpretable”, and the like. In our experience random forests do remarkably well, with very little tuning required.

— Page 590, The Elements of Statistical Learning, 2016.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Tutorials

Papers

Books

APIs

Articles

Summary

In this tutorial, you discovered how to develop random forest ensembles for classification and regression.

Specifically, you learned:

  • Random forest ensemble is an ensemble of decision trees and a natural extension of bagging.
  • How to use the random forest ensemble for classification and regression with scikit-learn.
  • How to explore the effect of random forest model hyperparameters on model performance.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

21 Responses to How to Develop a Random Forest Ensemble in Python

  1. dcart April 20, 2020 at 6:20 am #

    Hi Jason

    Hope you are doing well in this time of lock down.

    Great articles as usual. Although I like to apply RF in my regression problem. However, I have issue with memory as the data is huge.

    Any advise using RF for huge data set?

    Thanks
    Dennis

  2. Elie April 23, 2020 at 4:05 am #

    Hi Jason, Are you planning a new book on EnsembleS?

    • Jason Brownlee April 23, 2020 at 6:12 am #

      Not sure at this stage.

      Are there ensemble topics you’d like me to write about?

  3. Elie April 24, 2020 at 5:17 am #

    probably advanced stacking and how to win kaggle/data science competitions

  4. Stelios April 26, 2020 at 4:34 pm #

    Thank you so much for your great support!

  5. David Sanchez May 10, 2020 at 11:25 pm #

    Hi Jason,

    I’m implementing a Random Forest and I’m getting a shifted time-series in the predictions. If I build the model for predicting e.g. 4 steps ahead, my time-series of predictions seems 4 steps shifted to the right comparing to my time-series of observations. If I try to predict 16 steps ahead, it seems 16 steps shifted.

    Any idea why this could be happening?

    Thanks for all your tutorials!

    • Jason Brownlee May 11, 2020 at 6:00 am #

      Yes, it sounds like the model has learned a persistence (no skill) forecast. E.g. it predicts the input as the output.

      • David Sanchez May 14, 2020 at 6:28 pm #

        Hi Jason,

        Thanks a lot for your reply.

        I’m also wondering, if I try to build a model where my train set has more variables than my test set, how should I proceed?

        As far as I’ve seen about it, I should recreate those missing variables in my test dataframe and set them as 0.

        • Jason Brownlee May 15, 2020 at 5:57 am #

          The number of variables (columns) must be the same in train and test sets.

  6. Grzegorz Kępisty May 26, 2020 at 10:46 pm #

    Very nice tutorial of RF usage!
    It is really practical to know good practices on those models – from my experience Random Forests are very competitive in real industrial applications! (often outperforms such competitors as Artificial Neural Networks).
    Regards!

  7. DIIO May 28, 2020 at 10:03 am #

    Hi Jason,
    Please, check it:
    “This means that larger negative MAE are better and a perfect model has a MAE of 0.”
    Thank you! enjoying a lot this stuff.

    • Jason Brownlee May 28, 2020 at 1:24 pm #

      It is correct.

      -10 is greater than -100.

      0 is greater than -10.

  8. manuela October 16, 2020 at 4:26 pm #

    Hello Jason, Please I have a question
    I have the following situation that is already programmed with Logistic regression, I have tried the same program with Random Forest in order to check how it could improve the accuracy.
    Actually, the accuracy was improved, but I don’t know if it is logical to use the Random Forest in my problem case.

    My case study is as follow :
    Based on a market dataset, I need to predict if a customer will buy a product or not depending on his prior history. I.e to know how much a customer bought the same product previously, and how much he just check it without buying it

    The used data has the following structure:

    Id clients CurrectProd P1+ P1- P2+ P2- P3+ P3- … PN+ PN- Output
    10 CL1 P1, P3 6 1 0 0 8 2 0 0 1
    11 CL1 P1, P2 7 1 5 2 0 0 0 0 1

    with:
    CurrentProd: means a list of products that I need to know if a customer will purchase,
    P1+: mean how many time à client buy product 1,
    P1-: refers to the number that a client checked a product 1 without buying it.

    columns present all products existing in the market so that I have data with too many features (min 200 PRODUCT) and at each row the most of those row take value 0 (becose there are not belong to CurrentPRod

    So I want to know if the random forest could be used in this situation

    PS: I must use the data as it is without any change in features or structure

    • manuela October 16, 2020 at 4:43 pm #

      Id..|..clients..|..CurrectProd..|.P1+.|.P1-.|.P2+.|.P2-.|.P3+.|.P3-.|. … .|.PN+.|.PN-.|.Output
      10.|….CL1….|……P1, P3……|…6….|..1…|…0….|…0…|..8….|…2…|. … .|…0…|…0….|….1
      11.|….CL1….|……P1, P2……|…7….|..1…|…5….|…2…|…0…|…0…|. … .|…0…|…0….|….1

    • Jason Brownlee October 17, 2020 at 5:58 am #

      Perhaps try it and compare results.

      The key will to find an appropriate representation for the problem. This may give you ideas (replace site with product):
      https://machinelearningmastery.com/faq/single-faq/how-to-develop-forecast-models-for-multiple-sites

      • manuela October 17, 2020 at 6:45 am #

        Thanks for your quick replay
        I have already tried it, and it gives me a good result,
        but I want to know if it is logical to use it with 200 features (Product1, Product2….)

        • Jason Brownlee October 17, 2020 at 1:41 pm #

          Use the features that result in the best performance, regardless of how many.

Leave a Reply