Ensemble Machine Learning Algorithms in Python with scikit-learn

Ensembles can give you a boost in accuracy on your dataset.

In this post you will discover how you can create some of the most powerful types of ensembles in Python using scikit-learn.

This case study will step you through Boosting, Bagging and Majority Voting and show you how you can continue to ratchet up the accuracy of the models on your own datasets.

Let’s get started.

  • Update Jan/2017: Updated to reflect changes to the scikit-learn API in version 0.18.
Ensemble Machine Learning Algorithms in Python with scikit-learn

Ensemble Machine Learning Algorithms in Python with scikit-learn
Photo by The United States Army Band, some rights reserved.

Combine Model Predictions Into Ensemble Predictions

The three most popular methods for combining the predictions from different models are:

  • Bagging. Building multiple models (typically of the same type) from different subsamples of the training dataset.
  • Boosting. Building multiple models (typically of the same type) each of which learns to fix the prediction errors of a prior model in the chain.
  • Voting. Building multiple models (typically of differing types) and simple statistics (like calculating the mean) are used to combine predictions.

This post will not explain each of these methods.

It assumes you are generally familiar with machine learning algorithms and ensemble methods and that you are looking for information on how to create ensembles in Python.

Need help with Machine Learning in Python?

Take my free 2-week email course and discover data prep, algorithms and more (with sample code).

Click to sign-up now and also get a free PDF Ebook version of the course.

Start Your FREE Mini-Course Now!

About the Recipes

Each recipe in this post was designed to be standalone. This is so that you can copy-and-paste it into your project and start using it immediately.

A standard classification problem from the UCI Machine Learning Repository is used to demonstrate each ensemble algorithm. This is the Pima Indians onset of Diabetes dataset. It is a binary classification problem where all of the input variables are numeric and have differing scales.

Each ensemble algorithm is demonstrated using 10 fold cross validation, a standard technique used to estimate the performance of any machine learning algorithm on unseen data.

Bagging Algorithms

Bootstrap Aggregation or bagging involves taking multiple samples from your training dataset (with replacement) and training a model for each sample.

The final output prediction is averaged across the predictions of all of the sub-models.

The three bagging models covered in this section are as follows:

  1. Bagged Decision Trees
  2. Random Forest
  3. Extra Trees

1. Bagged Decision Trees

Bagging performs best with algorithms that have high variance. A popular example are decision trees, often constructed without pruning.

In the example below see an example of using the BaggingClassifier with the Classification and Regression Trees algorithm (DecisionTreeClassifier). A total of 100 trees are created.

Running the example, we get a robust estimate of model accuracy.

2. Random Forest

Random forest is an extension of bagged decision trees.

Samples of the training dataset are taken with replacement, but the trees are constructed in a way that reduces the correlation between individual classifiers. Specifically, rather than greedily choosing the best split point in the construction of the tree, only a random subset of features are considered for each split.

You can construct a Random Forest model for classification using the RandomForestClassifier class.

The example below provides an example of Random Forest for classification with 100 trees and split points chosen from a random selection of 3 features.

Running the example provides a mean estimate of classification accuracy.

3. Extra Trees

Extra Trees are another modification of bagging where random trees are constructed from samples of the training dataset.

You can construct an Extra Trees model for classification using the ExtraTreesClassifier class.

The example below provides a demonstration of extra trees with the number of trees set to 100 and splits chosen from 7 random features.

Running the example provides a mean estimate of classification accuracy.

Boosting Algorithms

Boosting ensemble algorithms creates a sequence of models that attempt to correct the mistakes of the models before them in the sequence.

Once created, the models make predictions which may be weighted by their demonstrated accuracy and the results are combined to create a final output prediction.

The two most common boosting ensemble machine learning algorithms are:

  1. AdaBoost
  2. Stochastic Gradient Boosting

1. AdaBoost

AdaBoost was perhaps the first successful boosting ensemble algorithm. It generally works by weighting instances in the dataset by how easy or difficult they are to classify, allowing the algorithm to pay or or less attention to them in the construction of subsequent models.

You can construct an AdaBoost model for classification using the AdaBoostClassifier class.

The example below demonstrates the construction of 30 decision trees in sequence using the AdaBoost algorithm.

Running the example provides a mean estimate of classification accuracy.

2. Stochastic Gradient Boosting

Stochastic Gradient Boosting (also called Gradient Boosting Machines) are one of the most sophisticated ensemble techniques. It is also a technique that is proving to be perhaps of the the best techniques available for improving performance via ensembles.

You can construct a Gradient Boosting model for classification using the GradientBoostingClassifier class.

The example below demonstrates Stochastic Gradient Boosting for classification with 100 trees.

Running the example provides a mean estimate of classification accuracy.

Voting Ensemble

Voting is one of the simplest ways of combining the predictions from multiple machine learning algorithms.

It works by first creating two or more standalone models from your training dataset. A Voting Classifier can then be used to wrap your models and average the predictions of the sub-models when asked to make predictions for new data.

The predictions of the sub-models can be weighted, but specifying the weights for classifiers manually or even heuristically is difficult. More advanced methods can learn how to best weight the predictions from submodels, but this is called stacking (stacked aggregation) and is currently not provided in scikit-learn.

You can create a voting ensemble model for classification using the VotingClassifier class.

The code below provides an example of combining the predictions of logistic regression, classification and regression trees and support vector machines together for a classification problem.

Running the example provides a mean estimate of classification accuracy.

Summary

In this post you discovered ensemble machine learning algorithms for improving the performance of models on your problems.

You learned about:

  • Bagging Ensembles including Bagged Decision Trees, Random Forest and Extra Trees.
  • Boosting Ensembles including AdaBoost and Stochastic Gradient Boosting.
  • Voting Ensembles for averaging the predictions for any arbitrary models.

Do you have any questions about ensemble machine learning algorithms or ensembles in scikit-learn? Ask your questions in the comments and I will do my best to answer them.


Frustrated With Python Machine Learning?

Master Machine Learning With Python

Develop Your Own Models in Minutes

…with just a few lines of scikit-learn code

Discover how in my new Ebook:
Machine Learning Mastery With Python

Covers self-study tutorials and end-to-end projects like:
Loading data, visualization, modeling, tuning, and much more…

Finally Bring Machine Learning To
Your Own Projects

Skip the Academics. Just Results.

Click to learn more.


49 Responses to Ensemble Machine Learning Algorithms in Python with scikit-learn

  1. Vishal June 9, 2016 at 7:37 am #

    Informative post.

    Once you identify and finalize the best ensemble model, how would you score a future sample with such model? I am referring to the productionalization of the model in a data base.

    Would you use something like the pickle package? Or is there a way to spell out the scoring algorithm (IF-ELSE rules for decision tree, or the actual formula for logistic regression) and use the formula for future scoring purposes?

    • Jason Brownlee June 14, 2016 at 8:21 am #

      After you finalize the model you can incorporate it into an application on service.

      The model would be used directly.

      It would be provided input patterns and make predictions that you could use in some operational way.

  2. Christos Chatzoglou July 26, 2016 at 10:38 pm #

    Very well written post! Is there any email we could send you some questions about the ensemble methods?

  3. Kamagne September 19, 2016 at 11:24 pm #

    HI Jason,
    I,ve copy and paste your Random Forest and then result is:
    0.766814764183.
    We don’t have the same result, could you tell me why?
    Thanks

    • Jason Brownlee September 20, 2016 at 8:33 am #

      I try to fix the random number seed Kamagne, but sometimes things get through.

      The performance of any machine learning algorithm is stochastic, we estimate performance in the range. It is best practice to run a give configuration many times and take the mean and standard deviation – reporting the range of expected performance on unseen data.

      Does that help?

  4. Natheer Alabsi November 18, 2016 at 1:26 pm #

    The ensembeled model gave lower accuracy compared to the individual models. Isn’t strange?

    • Jason Brownlee November 19, 2016 at 8:41 am #

      This can happen. Ensembles are not a sure-thing to better performance.

  5. Marido December 20, 2016 at 11:06 pm #

    In your Bagging Classifier you used Decision Tree Classifier as your base estimator. I was wondering what other algorithms can be used as base estimators?

    • Jason Brownlee December 21, 2016 at 8:38 am #

      Good question Marido,

      With bagging, the goal is to use a method that has high variance when trained on different data.

      Un-pruned decision trees can do this (and can be made to do it even better – see random forest).
      Another idea would be knn with a small k.

      In fact, take your favorite algorithm and configure it to have a high variance, then bag it.

      I hope that helps.

  6. Natheer Alabsi January 26, 2017 at 9:00 pm #

    Hi Jason,

    Thanks so much for your insightful replies.

    What I understand is that ensembles improve the result if they make different mistakes.
    In my below result of two models. The first model performs well in one class while the second model performs well on the other class. When I ensemble them, I get lower accuracy. Is that possible or I am doing something wrong.

    GBC
    [[922035 266]

    [ 2 5]]

    cart
    [[895914 26387]

    [ 0 7]]

    This is how how I am doing it.
    estimators = []
    model1 = GradientBoostingClassifier()
    estimators.append((‘GBC’, model1))
    model2 = DecisionTreeClassifier()
    estimators.append((‘cart’, model2))

    ensemble = VotingClassifier(estimators)
    ensemble.fit(X_train, Y_train)
    predictions = ensemble.predict(X_test)
    accuracy1 = accuracy_score(Y_test, predictions)

    • Jason Brownlee January 27, 2017 at 12:04 pm #

      Hi Natheer,

      There is no guarantee for ensembles to lift performance.

      Perhaps you can try a more sophisticated method to combine the predictions or perhaps try more or different submodels.

  7. Djib January 27, 2017 at 8:41 am #

    Hello Jason,

    Many thanks for your post. It is possible to have two different base estimators (i.e. decision tree, knn) in AdaBoost model?

    Thank you.

    • Jason Brownlee January 27, 2017 at 12:27 pm #

      Not that I am aware in Python Djib.

      I don’t see why not in theory. You could develop your own implementation and see how it fairs.

      • Djib January 28, 2017 at 4:49 am #

        Thank you Jason

  8. Adnan Ardhian March 13, 2017 at 2:06 pm #

    Hello Jason,

    Why max_features is 3? what is it exactly? Thank you 😀

    • Jason Brownlee March 14, 2017 at 8:13 am #

      Just for demonstration purposes. It limits the number of selected features to 3.

  9. Paige April 4, 2017 at 8:56 am #

    I found one slight mishap. In the voting ensemble code, I notice is that in the voting ensemble code, on lines 22 and 23 it has

    model3 = SVC()
    estimators.append((‘svm’, model2))

    I believe it should instead say model3 instead of model2, as model 3 is the svm stuff.

  10. Sam May 15, 2017 at 11:46 pm #

    Hey! Thanks for the great post.

    I would like to use voting with SVM as you did, however scaling data SVM gives me better results and it’s simply much faster. And from here comes the question: How can I scale just parto of the data for algorithms such as SVM, and leave non-slcaed data for XGB/Random forest and on top of it use ensembles. I have tried using Pipeline to first scale the data for SVM and then use Voting but it seams not working. Any comment would be helpful

    • Jason Brownlee May 16, 2017 at 8:47 am #

      Great question.

      You could leave the pipeline behind and put the pieces together yourself with a scaled version of the data for SVM and non-scaled for the other methods.

  11. Sayak Paul June 15, 2017 at 10:02 pm #

    Could we take it further and build a Neural Network model with Keras and use it in the Voting based Ensemble learning? (After wrapping the Neural Network model into a Scikit Learn classifier)

  12. Sanga June 16, 2017 at 5:20 am #

    Hi Jason…Thanks for the wonderful post. Is there a way to perform bagging on neural net regression? A sample code or example would be much appreciated. Thanks.

    • Jason Brownlee June 16, 2017 at 8:07 am #

      Yes. It may perform quite well. Sorry I do not have an example.

  13. Peter August 21, 2017 at 4:15 pm #

    Hi Jason!

    Thank you for the great post!

    I would like to make soft voting for a convolutional neural network and a gru recurrent neural network, but i have 2 problems.

    1: I have 2 different training datasets to train my networks on: vectors of prosodic data, and word embeddings of textual data. The 2 training sets are stored in two different np.arrays with different dimensionality. Is there any way to make VotingClassifier accept X1,and X2 except of a single X? (y is the same for both X1 and X2, and naturally they are of the same length)

    2: where do i compile my networks?

    ensemble = VotingClassifier(estimators)
    ensemble.compile()
    ?

    I would be very grateful for any help.
    Thank you!

    • Jason Brownlee August 21, 2017 at 4:28 pm #

      You have a few options.

      You could get each model to generate the predictions, save them to a file, then have another model learn how to combine the predictions, perhaps with or without the original inputs.

      If the original inputs are high-dimensional (images and sequences), you could try training a neural net to combine the predictions as part of training each sub-model. You can merge each network using a Merge layer in Keras (deep learning library), if your sub-models were also developed in Keras.

      I hope that helps as a start.

      • Peter August 22, 2017 at 1:05 pm #

        Thanks for the quick reply! I ve already tried the layer merging. It works, but not giving good results because one of my feature sets yields significantly better recognition accuracy than the other.

        But the first solution looks good! I will try and implement it!

        Thanks for the help!

  14. Mehmood August 25, 2017 at 10:14 pm #

    Can I use more than one base estimator in Bagging and adaboost eg Bagging(Knn, Logistic Regression, etc)?

    • Jason Brownlee August 26, 2017 at 6:46 am #

      You can do anything, but really this is only practical with bagging.

      You want to pick base estimators that have low bias/high variance, like k=1 kNN, decision trees without pruning or decision stumps, etc.

  15. amar September 2, 2017 at 6:07 pm #

    what is the meaning of seed here? can you explain the importance of seed and how can some changes in the seed will affect the model? Thanks

  16. sajana September 26, 2017 at 7:57 pm #

    hi sir

    i am unable to run the gradient boosting code on my dataset.

    please help me.

    see my parameters.

    AGE Haemoglobin RBC Hct Mcv Mch Mchc Platelets WBC Granuls Lymphocytes Monocytes disese
    3 9.6 4.2 28.2 67 22.7 33.9 3.75 5800 44 50 6 Positive
    11 12.1 4.3 33.7 78 28.2 36 2.22 6100 73 23 4 Positive
    2 9.5 4.1 27.9 67 22.8 34 3.64 5100 64 32 4 Positive
    4 9.9 3.9 27.8 71 25.3 35.6 2.06 4900 65 32 3 Positive
    14 10.7 4.4 31.2 70 24.2 34.4 3 7600 50 44 6 Negative
    7 9.8 4.2 28 66 23.2 35.1 1.95 3800 28 63 9 Negative
    8 14.6 5 39.2 77 28.7 37.2 3.06 4400 58 36 6 Negative
    4 12 4.5 33.3 74 26.5 35.9 5.28 9500 40 54 6 Negative
    2 11.2 4.6 32.7 70 24.1 34.3 2.98 8800 38 58 4 Negative
    1 9.1 4 27.2 67 22.4 33.3 3.6 5300 40 55 5 Negative
    11 14.8 5.8 42.5 72 25.1 34.8 4.51 17200 75 20 5 Negative

  17. sajana September 29, 2017 at 10:46 am #

    hi sir ,

    problem is first i want to balance the dataset with SMOTE algorithm but it is not happening.

    see my code and help me.

    import pandas
    import matplotlib.pyplot as plt
    import seaborn as sns
    from sklearn.datasets import make_classification
    from sklearn.decomposition import PCA

    from imblearn.over_sampling import SMOTE

    print(__doc__)

    sns.set()

    # Define some color for the plotting
    almost_black = ‘#262626’
    palette = sns.color_palette()
    data = (‘mdata.csv’)
    dataframe = pandas.read_csv(data)
    array = dataframe.values
    X = array[:,0:12]
    y = array[:,12]

    #print (X_train, Y_train)

    # Generate the dataset
    #X, y = make_classification(n_classes=2, class_sep=2, weights=[0.1, 0.9],
    # n_informative=3, n_redundant=1, flip_y=0,
    # n_features=10, n_clusters_per_class=1,
    # n_samples=500, random_state=10)
    print (X, y)
    plt.show()

    # Instanciate a PCA object for the sake of easy visualisation
    pca = PCA(n_components=2)
    # Fit and transform x to visualise inside a 2D feature space
    X_vis = pca.fit_transform(X)

    # Apply regular SMOTE
    sm = SMOTE(kind=’regular’)
    X_resampled, y_resampled = sm.fit_sample(X, y)
    X_res_vis = pca.transform(X_resampled)

    # Two subplots, unpack the axes array immediately
    f, (ax1, ax2) = plt.subplots(1, 2)

    ax1.scatter(X_vis[y == 0, 0], X_vis[y == 0, 1], label=”Class #0″, alpha=0.5,
    edgecolor=almost_black, facecolor=palette[0], linewidth=0.15)
    ax1.scatter(X_vis[y == 1, 0], X_vis[y == 1, 1], label=”Class #1″, alpha=0.5,
    edgecolor=almost_black, facecolor=palette[2], linewidth=0.15)
    ax1.set_title(‘Original set’)

    ax2.scatter(X_res_vis[y_resampled == 0, 0], X_res_vis[y_resampled == 0, 1],
    label=”Class #0″, alpha=.5, edgecolor=almost_black,
    facecolor=palette[0], linewidth=0.15)
    ax2.scatter(X_res_vis[y_resampled == 1, 0], X_res_vis[y_resampled == 1, 1],
    label=”Class #1″, alpha=.5, edgecolor=almost_black,
    facecolor=palette[2], linewidth=0.15)
    ax2.set_title(‘SMOTE ALGORITHM – Malaria regular’)
    “””
    daata after resample
    “””
    print (X_resampled, y_resampled)
    plt.show()

    • Jason Brownlee September 30, 2017 at 7:32 am #

      Sorry, I cannot debug your code for you. Perhaps you can post your code to stackoverflow?

  18. sajana September 29, 2017 at 10:55 am #

    i am getting the error :

    File “/usr/local/lib/python2.7/dist-packages/imblearn/over_sampling/smote.py”, line 360, in _sample_regular
    nns = self.nn_k_.kneighbors(X_class, return_distance=False)[:, 1:]
    File “/home/sajana/.local/lib/python2.7/site-packages/sklearn/neighbors/base.py”, line 347, in kneighbors
    (train_size, n_neighbors)
    ValueError: Expected n_neighbors <= n_samples, but n_samples = 5, n_neighbors = 6

    • Jason Brownlee September 30, 2017 at 7:33 am #

      It looks like your k is larger than the number of instances in one class. You need to reduce k or increase the number of instances for the least represented class.

      I hope that helps.

  19. sajana October 5, 2017 at 7:43 pm #

    i cleared the error sir

    but when i work with Gradientboosting it doesn’t work even though my dataset contains 2 classes as shown in the above discussion.

    error : binomial deviance require 2 classes

    and code :
    import pandas
    import matplotlib.pyplot as plt
    from sklearn import model_selection
    from sklearn.ensemble import GradientBoostingClassifier
    from sklearn.metrics import accuracy_score
    data = (‘dataset160.csv’)
    dataframe = pandas.read_csv(data)
    array = dataframe.values
    X = array[:,0:12]
    Y = array[:,12]
    print (X, Y)
    plt.show()
    seed = 7
    num_trees = 100
    kfold = model_selection.KFold(n_splits=10, random_state=seed)
    model = GradientBoostingClassifier(n_estimators=num_trees, random_state=seed)
    results = model_selection.cross_val_score(model, X, Y, cv=kfold)
    print(results.mean())
    print(results)

  20. sajana October 6, 2017 at 11:30 am #

    sir, i have already labelled data set ,

    i am unable to run the gradient boosting code on my dataset.

    please help me.

    see my parameters.

    AGE Haemoglobin RBC Hct Mcv Mch Mchc Platelets WBC Granuls Lymphocytes Monocytes disese
    3 9.6 4.2 28.2 67 22.7 33.9 3.75 5800 44 50 6 Positive
    11 12.1 4.3 33.7 78 28.2 36 2.22 6100 73 23 4 Positive
    2 9.5 4.1 27.9 67 22.8 34 3.64 5100 64 32 4 Positive
    4 9.9 3.9 27.8 71 25.3 35.6 2.06 4900 65 32 3 Positive
    14 10.7 4.4 31.2 70 24.2 34.4 3 7600 50 44 6 Negative
    7 9.8 4.2 28 66 23.2 35.1 1.95 3800 28 63 9 Negative
    8 14.6 5 39.2 77 28.7 37.2 3.06 4400 58 36 6 Negative
    4 12 4.5 33.3 74 26.5 35.9 5.28 9500 40 54 6 Negative
    2 11.2 4.6 32.7 70 24.1 34.3 2.98 8800 38 58 4 Negative
    1 9.1 4 27.2 67 22.4 33.3 3.6 5300 40 55 5 Negative
    11 14.8 5.8 42.5 72 25.1 34.8 4.51 17200 75 20 5 Negative

  21. sajana October 12, 2017 at 7:04 pm #

    thank you sir.
    sir i have a small doubt.

    is that XGBoost algorithm is best or SMOTEBoost algorithm is best to handle skewed data.

    • Jason Brownlee October 13, 2017 at 5:45 am #

      Try both on your specific dataset and see what works best.

  22. sajana October 19, 2017 at 7:27 pm #

    thank you so much sir,

    finally i have a doubt sir.
    when i run the prediction code using adaboost i am getting 0.0%prediction accuracy .

    for the code :

    model = AdaBoostClassifier(n_estimators=num_trees, random_state=seed)
    results = model_selection.cross_val_score(model, X, Y, cv=kfold)
    model.fit(X, Y)
    print(‘learning accuracy’)
    print(results.mean())
    predictions = model.predict(A)
    print(predictions)
    accuracy1 = accuracy_score(B, predictions)
    print(“Accuracy % is “)
    print(accuracy1*100)

    anything wrong in the code . always i am getting 0.0.%accuracy and precision recall also 0% for any ensemble like boosting , bagging.
    kindly rectify sir.

  23. Amos Bunde November 6, 2017 at 8:35 pm #

    Dr. Jason you ARE doing a great job in machine learning.

Leave a Reply