How to Implement Bagging From Scratch With Python

By Jason Brownlee on August 13, 2019 in Code Algorithms From Scratch 31

Decision trees are a simple and powerful predictive modeling technique, but they suffer from high-variance.

This means that trees can get very different results given different training data.

A technique to make decision trees more robust and to achieve better performance is called bootstrap aggregation or bagging for short.

In this tutorial, you will discover how to implement the bagging procedure with decision trees from scratch with Python.

After completing this tutorial, you will know:

How to create a bootstrap sample of your dataset.
How to make predictions with bootstrapped models.
How to apply bagging to your own predictive modeling problems.

Kick-start your project with my new book Machine Learning Algorithms From Scratch, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Update Jan/2017: Changed the calculation of fold_size in cross_validation_split() to always be an integer. Fixes issues with Python 3.
Update Feb/2017: Fixed a bug in build_tree.
Update Aug/2017: Fixed a bug in Gini calculation, added the missing weighting of group Gini scores by group size (thanks Michael!).
Update Aug/2018: Tested and updated to work with Python 3.6.

How to Implement Bagging From Scratch With Python
Photo by Michael Cory, some rights reserved.

Descriptions

This section provides a brief description to Bootstrap Aggregation and the Sonar dataset that will be used in this tutorial.

Bootstrap Aggregation Algorithm

A bootstrap is a sample of a dataset with replacement.

This means that a new dataset is created from a random sample of an existing dataset where a given row may be selected and added more than once to the sample.

It is a useful approach to use when estimating values such as the mean for a broader dataset, when you only have a limited dataset available. By creating samples of your dataset and estimating the mean from those samples, you can take the average of those estimates and get a better idea of the true mean of the underlying problem.

This same approach can be used with machine learning algorithms that have a high variance, such as decision trees. A separate model is trained on each bootstrap sample of data and the average output of those models used to make predictions. This technique is called bootstrap aggregation or bagging for short.

Variance means that an algorithm’s performance is sensitive to the training data, with high variance suggesting that the more the training data is changed, the more the performance of the algorithm will vary.

The performance of high variance machine learning algorithms like unpruned decision trees can be improved by training many trees and taking the average of their predictions. Results are often better than a single decision tree.

Another benefit of bagging in addition to improved performance is that the bagged decision trees cannot overfit the problem. Trees can continue to be added until a maximum in performance is achieved.

Sonar Dataset

The dataset we will use in this tutorial is the Sonar dataset.

This is a dataset that describes sonar chirp returns bouncing off different surfaces. The 60 input variables are the strength of the returns at different angles. It is a binary classification problem that requires a model to differentiate rocks from metal cylinders. There are 208 observations.

It is a well-understood dataset. All of the variables are continuous and generally in the range of 0 to 1. The output variable is a string “M” for mine and “R” for rock, which will need to be converted to integers 1 and 0.

By predicting the class with the most observations in the dataset (M or mines) the Zero Rule Algorithm can achieve an accuracy of 53%.

You can learn more about this dataset at the UCI Machine Learning repository.

Download the dataset for free and place it in your working directory with the filename sonar.all-data.csv.

Tutorial

This tutorial is broken down into 2 parts:

Bootstrap Resample.
Sonar Dataset Case Study.

These steps provide the foundation that you need to implement and apply bootstrap aggregation with decision trees to your own predictive modeling problems.

1. Bootstrap Resample

Let’s start off by getting a strong idea of how the bootstrap method works.

We can create a new sample of a dataset by randomly selecting rows from the dataset and adding them to a new list. We can repeat this for a fixed number of rows or until the size of the new dataset matches a ratio of the size of the original dataset.

We can allow sampling with replacement by not removing the row that was selected so that it is available for future selections.

Below is a function named subsample() that implements this procedure. The randrange() function from the random module is used to select a random row index to add to the sample each iteration of the loop. The default size of the sample is the size of the original dataset.

# Create a random subsample from the dataset with replacement
def subsample(dataset, ratio=1.0):
	sample = list()
	n_sample = round(len(dataset) * ratio)
	while len(sample) < n_sample:
		index = randrange(len(dataset))
		sample.append(dataset[index])
	return sample

# Create a random subsample from the dataset with replacement

def subsample(dataset, ratio=1.0):

sample = list()

n_sample = round(len(dataset) * ratio)

while len(sample) < n_sample:

index = randrange(len(dataset))

sample.append(dataset[index])

return sample

We can use this function to estimate the mean of a contrived dataset.

First, we can create a dataset with 20 rows and a single column of random numbers between 0 and 9 and calculate the mean value.

We can then make bootstrap samples of the original dataset, calculate the mean, and repeat this process until we have a list of means. Taking the average of these sample means should give us a robust estimate of the mean of the entire dataset.

The complete example is listed below.

Each bootstrap sample is created as a 10% sample of the original 20 observation dataset (or 2 observations). We then experiment by creating 1, 10, 100 bootstrap samples of the original dataset, calculate their mean value, then average all of those estimated mean values.

from random import seed
from random import random
from random import randrange


# Create a random subsample from the dataset with replacement
def subsample(dataset, ratio=1.0):
	sample = list()
	n_sample = round(len(dataset) * ratio)
	while len(sample) < n_sample:
		index = randrange(len(dataset))
		sample.append(dataset[index])
	return sample


# Calculate the mean of a list of numbers
def mean(numbers):
	return sum(numbers) / float(len(numbers))


seed(1)
# True mean
dataset = [[randrange(10)] for i in range(20)]
print('True Mean: %.3f' % mean([row[0] for row in dataset]))
# Estimated means
ratio = 0.10
for size in [1, 10, 100]:
	sample_means = list()
	for i in range(size):
		sample = subsample(dataset, ratio)
		sample_mean = mean([row[0] for row in sample])
		sample_means.append(sample_mean)
	print('Samples=%d, Estimated Mean: %.3f' % (size, mean(sample_means)))

from random import seed

from random import random

from random import randrange

# Create a random subsample from the dataset with replacement

def subsample(dataset, ratio=1.0):

sample = list()

n_sample = round(len(dataset) * ratio)

while len(sample) < n_sample:

index = randrange(len(dataset))

sample.append(dataset[index])

return sample

# Calculate the mean of a list of numbers

def mean(numbers):

return sum(numbers) / float(len(numbers))

seed(1)

# True mean

dataset = [[randrange(10)] for i in range(20)]

print('True Mean: %.3f' % mean([row[0] for row in dataset]))

# Estimated means

ratio = 0.10

for size in [1, 10, 100]:

sample_means = list()

for i in range(size):

sample = subsample(dataset, ratio)

sample_mean = mean([row[0] for row in sample])

sample_means.append(sample_mean)

print('Samples=%d, Estimated Mean: %.3f' % (size, mean(sample_means)))

Running the example prints the original mean value we aim to estimate.

We can then see the estimated mean from the various different numbers of bootstrap samples. We can see that with 100 samples we achieve a good estimate of the mean.

True Mean: 4.450
Samples=1, Estimated Mean: 4.500
Samples=10, Estimated Mean: 3.300
Samples=100, Estimated Mean: 4.480

True Mean: 4.450

Samples=1, Estimated Mean: 4.500

Samples=10, Estimated Mean: 3.300

Samples=100, Estimated Mean: 4.480

Instead of calculating the mean value, we can create a model from each subsample.

Next, let’s see how we can combine the predictions from multiple bootstrap models.

2. Sonar Dataset Case Study

In this section, we will apply the Random Forest algorithm to the Sonar dataset.

The example assumes that a CSV copy of the dataset is in the current working directory with the file name sonar.all-data.csv.

The dataset is first loaded, the string values converted to numeric and the output column is converted from strings to the integer values of 0 to 1. This is achieved with helper functions load_csv(), str_column_to_float() and str_column_to_int() to load and prepare the dataset.

We will use k-fold cross validation to estimate the performance of the learned model on unseen data. This means that we will construct and evaluate k models and estimate the performance as the mean model error. Classification accuracy will be used to evaluate each model. These behaviors are provided in the cross_validation_split(), accuracy_metric() and evaluate_algorithm() helper functions.

We will also use an implementation of the Classification and Regression Trees (CART) algorithm adapted for bagging including the helper functions test_split() to split a dataset into groups, gini_index() to evaluate a split point, get_split() to find an optimal split point, to_terminal(), split() and build_tree() used to create a single decision tree, predict() to make a prediction with a decision tree and the subsample() function described in the previous step to make a subsample of the training dataset

A new function named bagging_predict() is developed that is responsible for making a prediction with each decision tree and combining the predictions into a single return value. This is achieved by selecting the most common prediction from the list of predictions made by the bagged trees.

Finally, a new function named bagging() is developed that is responsible for creating the samples of the training dataset, training a decision tree on each, then making predictions on the test dataset using the list of bagged trees.

The complete example is listed below.

# Bagging Algorithm on the Sonar dataset
from random import seed
from random import randrange
from csv import reader

# Load a CSV file
def load_csv(filename):
	dataset = list()
	with open(filename, 'r') as file:
		csv_reader = reader(file)
		for row in csv_reader:
			if not row:
				continue
			dataset.append(row)
	return dataset

# Convert string column to float
def str_column_to_float(dataset, column):
	for row in dataset:
		row[column] = float(row[column].strip())

# Convert string column to integer
def str_column_to_int(dataset, column):
	class_values = [row[column] for row in dataset]
	unique = set(class_values)
	lookup = dict()
	for i, value in enumerate(unique):
		lookup[value] = i
	for row in dataset:
		row[column] = lookup[row[column]]
	return lookup

# Split a dataset into k folds
def cross_validation_split(dataset, n_folds):
	dataset_split = list()
	dataset_copy = list(dataset)
	fold_size = int(len(dataset) / n_folds)
	for i in range(n_folds):
		fold = list()
		while len(fold) < fold_size:
			index = randrange(len(dataset_copy))
			fold.append(dataset_copy.pop(index))
		dataset_split.append(fold)
	return dataset_split

# Calculate accuracy percentage
def accuracy_metric(actual, predicted):
	correct = 0
	for i in range(len(actual)):
		if actual[i] == predicted[i]:
			correct += 1
	return correct / float(len(actual)) * 100.0

# Evaluate an algorithm using a cross validation split
def evaluate_algorithm(dataset, algorithm, n_folds, *args):
	folds = cross_validation_split(dataset, n_folds)
	scores = list()
	for fold in folds:
		train_set = list(folds)
		train_set.remove(fold)
		train_set = sum(train_set, [])
		test_set = list()
		for row in fold:
			row_copy = list(row)
			test_set.append(row_copy)
			row_copy[-1] = None
		predicted = algorithm(train_set, test_set, *args)
		actual = [row[-1] for row in fold]
		accuracy = accuracy_metric(actual, predicted)
		scores.append(accuracy)
	return scores

# Split a dataset based on an attribute and an attribute value
def test_split(index, value, dataset):
	left, right = list(), list()
	for row in dataset:
		if row[index] < value:
			left.append(row)
		else:
			right.append(row)
	return left, right

# Calculate the Gini index for a split dataset
def gini_index(groups, classes):
	# count all samples at split point
	n_instances = float(sum([len(group) for group in groups]))
	# sum weighted Gini index for each group
	gini = 0.0
	for group in groups:
		size = float(len(group))
		# avoid divide by zero
		if size == 0:
			continue
		score = 0.0
		# score the group based on the score for each class
		for class_val in classes:
			p = [row[-1] for row in group].count(class_val) / size
			score += p * p
		# weight the group score by its relative size
		gini += (1.0 - score) * (size / n_instances)
	return gini

# Select the best split point for a dataset
def get_split(dataset):
	class_values = list(set(row[-1] for row in dataset))
	b_index, b_value, b_score, b_groups = 999, 999, 999, None
	for index in range(len(dataset[0])-1):
		for row in dataset:
		# for i in range(len(dataset)):
		# 	row = dataset[randrange(len(dataset))]
			groups = test_split(index, row[index], dataset)
			gini = gini_index(groups, class_values)
			if gini < b_score:
				b_index, b_value, b_score, b_groups = index, row[index], gini, groups
	return {'index':b_index, 'value':b_value, 'groups':b_groups}

# Create a terminal node value
def to_terminal(group):
	outcomes = [row[-1] for row in group]
	return max(set(outcomes), key=outcomes.count)

# Create child splits for a node or make terminal
def split(node, max_depth, min_size, depth):
	left, right = node['groups']
	del(node['groups'])
	# check for a no split
	if not left or not right:
		node['left'] = node['right'] = to_terminal(left + right)
		return
	# check for max depth
	if depth >= max_depth:
		node['left'], node['right'] = to_terminal(left), to_terminal(right)
		return
	# process left child
	if len(left) <= min_size:
		node['left'] = to_terminal(left)
	else:
		node['left'] = get_split(left)
		split(node['left'], max_depth, min_size, depth+1)
	# process right child
	if len(right) <= min_size:
		node['right'] = to_terminal(right)
	else:
		node['right'] = get_split(right)
		split(node['right'], max_depth, min_size, depth+1)

# Build a decision tree
def build_tree(train, max_depth, min_size):
	root = get_split(train)
	split(root, max_depth, min_size, 1)
	return root

# Make a prediction with a decision tree
def predict(node, row):
	if row[node['index']] < node['value']:
		if isinstance(node['left'], dict):
			return predict(node['left'], row)
		else:
			return node['left']
	else:
		if isinstance(node['right'], dict):
			return predict(node['right'], row)
		else:
			return node['right']

# Create a random subsample from the dataset with replacement
def subsample(dataset, ratio):
	sample = list()
	n_sample = round(len(dataset) * ratio)
	while len(sample) < n_sample:
		index = randrange(len(dataset))
		sample.append(dataset[index])
	return sample

# Make a prediction with a list of bagged trees
def bagging_predict(trees, row):
	predictions = [predict(tree, row) for tree in trees]
	return max(set(predictions), key=predictions.count)

# Bootstrap Aggregation Algorithm
def bagging(train, test, max_depth, min_size, sample_size, n_trees):
	trees = list()
	for i in range(n_trees):
		sample = subsample(train, sample_size)
		tree = build_tree(sample, max_depth, min_size)
		trees.append(tree)
	predictions = [bagging_predict(trees, row) for row in test]
	return(predictions)

# Test bagging on the sonar dataset
seed(1)
# load and prepare data
filename = 'sonar.all-data.csv'
dataset = load_csv(filename)
# convert string attributes to integers
for i in range(len(dataset[0])-1):
	str_column_to_float(dataset, i)
# convert class column to integers
str_column_to_int(dataset, len(dataset[0])-1)
# evaluate algorithm
n_folds = 5
max_depth = 6
min_size = 2
sample_size = 0.50
for n_trees in [1, 5, 10, 50]:
	scores = evaluate_algorithm(dataset, bagging, n_folds, max_depth, min_size, sample_size, n_trees)
	print('Trees: %d' % n_trees)
	print('Scores: %s' % scores)
	print('Mean Accuracy: %.3f%%' % (sum(scores)/float(len(scores))))

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

# Bagging Algorithm on the Sonar dataset

from random import seed

from random import randrange

from csv import reader

# Load a CSV file

def load_csv(filename):

dataset = list()

with open(filename, 'r') as file:

csv_reader = reader(file)

for row in csv_reader:

if not row:

continue

dataset.append(row)

return dataset

# Convert string column to float

def str_column_to_float(dataset, column):

for row in dataset:

row[column] = float(row[column].strip())

# Convert string column to integer

def str_column_to_int(dataset, column):

class_values = [row[column] for row in dataset]

unique = set(class_values)

lookup = dict()

for i, value in enumerate(unique):

lookup[value] = i

for row in dataset:

row[column] = lookup[row[column]]

return lookup

# Split a dataset into k folds

def cross_validation_split(dataset, n_folds):

dataset_split = list()

dataset_copy = list(dataset)

fold_size = int(len(dataset) / n_folds)

for i in range(n_folds):

fold = list()

while len(fold) < fold_size:

index = randrange(len(dataset_copy))

fold.append(dataset_copy.pop(index))

dataset_split.append(fold)

return dataset_split

# Calculate accuracy percentage

def accuracy_metric(actual, predicted):

correct = 0

for i in range(len(actual)):

if actual[i] == predicted[i]:

correct += 1

return correct / float(len(actual)) * 100.0

# Evaluate an algorithm using a cross validation split

def evaluate_algorithm(dataset, algorithm, n_folds, *args):

folds = cross_validation_split(dataset, n_folds)

scores = list()

for fold in folds:

train_set = list(folds)

train_set.remove(fold)

train_set = sum(train_set, [])

test_set = list()

for row in fold:

row_copy = list(row)

test_set.append(row_copy)

row_copy[-1] = None

predicted = algorithm(train_set, test_set, *args)

actual = [row[-1] for row in fold]

accuracy = accuracy_metric(actual, predicted)

scores.append(accuracy)

return scores

# Split a dataset based on an attribute and an attribute value

def test_split(index, value, dataset):

left, right = list(), list()

for row in dataset:

if row[index] < value:

left.append(row)

else:

right.append(row)

return left, right

# Calculate the Gini index for a split dataset

def gini_index(groups, classes):

# count all samples at split point

n_instances = float(sum([len(group) for group in groups]))

# sum weighted Gini index for each group

gini = 0.0

for group in groups:

size = float(len(group))

# avoid divide by zero

if size == 0:

continue

score = 0.0

# score the group based on the score for each class

for class_val in classes:

p = [row[-1] for row in group].count(class_val) / size

score += p * p

# weight the group score by its relative size

gini += (1.0 - score) * (size / n_instances)

return gini

# Select the best split point for a dataset

def get_split(dataset):

class_values = list(set(row[-1] for row in dataset))

b_index, b_value, b_score, b_groups = 999, 999, 999, None

for index in range(len(dataset[0])-1):

for row in dataset:

# for i in range(len(dataset)):

# row = dataset[randrange(len(dataset))]

groups = test_split(index, row[index], dataset)

gini = gini_index(groups, class_values)

if gini < b_score:

b_index, b_value, b_score, b_groups = index, row[index], gini, groups

return {'index':b_index, 'value':b_value, 'groups':b_groups}

# Create a terminal node value

def to_terminal(group):

outcomes = [row[-1] for row in group]

return max(set(outcomes), key=outcomes.count)

# Create child splits for a node or make terminal

def split(node, max_depth, min_size, depth):

left, right = node['groups']

del(node['groups'])

# check for a no split

if not left or not right:

node['left'] = node['right'] = to_terminal(left + right)

return

# check for max depth

if depth >= max_depth:

node['left'], node['right'] = to_terminal(left), to_terminal(right)

return

# process left child

if len(left) <= min_size:

node['left'] = to_terminal(left)

else:

node['left'] = get_split(left)

split(node['left'], max_depth, min_size, depth+1)

# process right child

if len(right) <= min_size:

node['right'] = to_terminal(right)

else:

node['right'] = get_split(right)

split(node['right'], max_depth, min_size, depth+1)

# Build a decision tree

def build_tree(train, max_depth, min_size):

root = get_split(train)

split(root, max_depth, min_size, 1)

return root

# Make a prediction with a decision tree

def predict(node, row):

if row[node['index']] < node['value']:

if isinstance(node['left'], dict):

return predict(node['left'], row)

else:

return node['left']

else:

if isinstance(node['right'], dict):

return predict(node['right'], row)

else:

return node['right']

# Create a random subsample from the dataset with replacement

def subsample(dataset, ratio):

sample = list()

n_sample = round(len(dataset) * ratio)

while len(sample) < n_sample:

index = randrange(len(dataset))

sample.append(dataset[index])

return sample

# Make a prediction with a list of bagged trees

def bagging_predict(trees, row):

predictions = [predict(tree, row) for tree in trees]

return max(set(predictions), key=predictions.count)

# Bootstrap Aggregation Algorithm

def bagging(train, test, max_depth, min_size, sample_size, n_trees):

trees = list()

for i in range(n_trees):

sample = subsample(train, sample_size)

tree = build_tree(sample, max_depth, min_size)

trees.append(tree)

predictions = [bagging_predict(trees, row) for row in test]

return(predictions)

# Test bagging on the sonar dataset

seed(1)

# load and prepare data

filename = 'sonar.all-data.csv'

dataset = load_csv(filename)

# convert string attributes to integers

for i in range(len(dataset[0])-1):

str_column_to_float(dataset, i)

# convert class column to integers

str_column_to_int(dataset, len(dataset[0])-1)

# evaluate algorithm

n_folds = 5

max_depth = 6

min_size = 2

sample_size = 0.50

for n_trees in [1, 5, 10, 50]:

scores = evaluate_algorithm(dataset, bagging, n_folds, max_depth, min_size, sample_size, n_trees)

print('Trees: %d' % n_trees)

print('Scores: %s' % scores)

print('Mean Accuracy: %.3f%%' % (sum(scores)/float(len(scores))))

A k value of 5 was used for cross-validation, giving each fold 208/5 = 41.6 or just over 40 records to be evaluated upon each iteration.

Deep trees were constructed with a max depth of 6 and a minimum number of training rows at each node of 2. Samples of the training dataset were created with 50% the size of the original dataset. This was to force some variety in the dataset subsamples used to train each tree. The default for bagging is to have the size of sample datasets match the size of the original training dataset.

A series of 4 different numbers of trees were evaluated to show the behavior of the algorithm.

The accuracy from each fold and the mean accuracy for each configuration are printed. We can see a trend of some minor lift in performance as the number of trees is increased.

Trees: 1
Scores: [87.8048780487805, 65.85365853658537, 65.85365853658537, 65.85365853658537, 73.17073170731707]
Mean Accuracy: 71.707%

Trees: 5
Scores: [60.97560975609756, 80.48780487804879, 78.04878048780488, 82.92682926829268, 63.41463414634146]
Mean Accuracy: 73.171%

Trees: 10
Scores: [60.97560975609756, 73.17073170731707, 82.92682926829268, 80.48780487804879, 68.29268292682927]
Mean Accuracy: 73.171%

Trees: 50
Scores: [63.41463414634146, 75.60975609756098, 80.48780487804879, 75.60975609756098, 85.36585365853658]
Mean Accuracy: 76.098%

Trees: 1

Scores: [87.8048780487805, 65.85365853658537, 65.85365853658537, 65.85365853658537, 73.17073170731707]

Mean Accuracy: 71.707%

Trees: 5

Scores: [60.97560975609756, 80.48780487804879, 78.04878048780488, 82.92682926829268, 63.41463414634146]

Mean Accuracy: 73.171%

Trees: 10

Scores: [60.97560975609756, 73.17073170731707, 82.92682926829268, 80.48780487804879, 68.29268292682927]

Mean Accuracy: 73.171%

Trees: 50

Scores: [63.41463414634146, 75.60975609756098, 80.48780487804879, 75.60975609756098, 85.36585365853658]

Mean Accuracy: 76.098%

A difficulty of this method is that even though deep trees are constructed, the bagged trees that are created are very similar. In turn, the predictions made by these trees are also similar, and the high variance we desire among the trees trained on different samples of the training dataset is diminished.

This is because of the greedy algorithm used in the construction of the trees selecting the same or similar split points.

The tutorial tried to re-inject this variance by constraining the sample size used to train each tree. A more robust technique is to constrain the features that may be evaluated when creating each split point. This is the method used in the Random Forest algorithm.

Extensions

Tune the Example. Explore different configurations for the number of trees and even individual tree configurations to see if you can further improve results.
Bag Another Algorithm. Other algorithms can be used with bagging. For example, a k-nearest neighbor algorithm with a low value of k will have a high variance and is a good candidate for bagging.
Regression Problems. Bagging can be used with regression trees. Instead of predicting the most common class value from the set of predictions, you can return the average of the predictions from the bagged trees. Experiment on regression problems.

Did you try any of these extensions?
Share your experiences in the comments below.

Review

In this tutorial, you discovered how to implement bootstrap aggregation from scratch with Python.

Specifically, you learned:

How to create a subsample and estimate bootstrap quantities.
How to create an ensemble of decision trees and use them to make predictions.
How to apply bagging to a real world predictive modeling problem.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

31 Responses to How to Implement Bagging From Scratch With Python

skorzec January 18, 2017 at 4:30 am #

Thanks for this example in python from scratch. I think the reason your score stays the same is because you are using the entire dataset to select your split attributes. This leads to similar trees and thus a small variance of the ensemble set. If you change your code, in such a way you get more variance in the trees, you will see an increase in performance of the ensemble. One solution is to alter this code snippet:

# Build a decision tree
def build_tree(train, max_depth, min_size):
root = get_split(train)
split(root, max_depth, min_size, 1)
return root

Reply
- Jason Brownlee January 18, 2017 at 10:17 am #
  
  Thanks for the tip!
  
  Reply
  - Nastaran May 17, 2018 at 2:38 pm #
    
    Hi Jason,
    
    Just wondering have you modified your code according his tip? ( cannot see any difference between your code and his suggestion)
    
    Reply
    - Jason Brownlee May 17, 2018 at 3:13 pm #
      
      Yes, it has been fixed. It used to use the whole dataset.
      
      Reply
Jovan Sardinha June 25, 2017 at 9:41 am #

thanks for the great tutorial!

I have re-written this script using sklearn for easy implementation:
https://gist.github.com/JovanSardinha/2c58bd1e7e3aa4c02affedfe7abe8a29

Reply
- Jason Brownlee June 26, 2017 at 6:04 am #
  
  Nice work!
  
  Reply
  - Alex Godfrey August 31, 2017 at 6:43 am #
    
    Hi Jason,
    
    Minor tip – In the string to integer conversion – I found that the unique set gets created in somewhat random order in repeat runs of this and others scripts that use this function. To avoid this I have changed the line to read-
    
    unique = sorted(set(class_values))
    
    This results in creating the same lookup dictionary every time. I ran across this when I was using the lookup dictionary to create nicely mapped printouts of the intermediate data.
    
    Reply
    - Jason Brownlee September 1, 2017 at 6:38 am #
      
      Great tip, thanks Alex!
      
      Reply
- Bhagwat September 16, 2020 at 9:19 pm #
  
  Hey, this link is not working
  
  Reply
  - Jason Brownlee September 17, 2020 at 6:45 am #
    
    See this:
    https://machinelearningmastery.com/bagging-ensemble-with-python/
    
    Reply
Leroy Veld September 3, 2018 at 8:02 am #

I am trying to implement your work into R.
However I am lost in translation. Specifically, a line in your “Calculate the Gini index for a split dataset” function. The following line in this function is a tough one to translate into R.

p = [row[-1] for row in group].count(class_val) / size

Would you mind helping me out? Or considering posting a R implementation of this code?

Many thanks in advance!

Reply
- Jason Brownlee September 3, 2018 at 1:34 pm #
  
  Sorry, I don’t have the capacity to translate code for you or debug new R implementations.
  
  Reply
Aymen September 24, 2018 at 5:15 am #

The problem i faced is i got different results when tuning parameters using grid search how can i fix it thanks

Reply
- Jason Brownlee September 24, 2018 at 6:13 am #
  
  This is to be expected:
  https://machinelearningmastery.com/faq/single-faq/why-do-i-get-different-results-each-time-i-run-the-code
  
  Reply
Shreyas SK November 2, 2018 at 3:59 pm #

I have implemented bagging with random forest regressor in python and i’ve tuned parameters but no where the results are converging. Training accuracy @ 95% and Testing acc @ 55%.

X = np.array([Depth, Dist, Soft, Stiff, sand, Sand, Stiff_C, Invert, FP, Penetr, Pitching, GP, GF]).T

Y = np.array(Settlement)

Y = Y.reshape(len(Y),)

data = [X, Y]
data = np.random.shuffle(data)

rf = result = BaggingRegressor(RandomForestRegressor(), n_estimators = 270,
bootstrap = True, oob_score = True, random_state = 42, max_features = 6)

rf.fit(X,Y)

pred = rf.predict(X)

r2 = r2_score(Y, pred)

plt.scatter(x=Y, y=pred)
plt.show()
print(“Train Accuracy : ” + str(r2))

print(“Test Accuracy : ” + str(rf.oob_score_))

Reply
- Jason Brownlee November 3, 2018 at 6:59 am #
  
  Perhaps try the sklearn implementation of the algorithms?
  
  Reply
Vinu March 28, 2019 at 8:05 am #

Hi Jason, Nice tutorial. However, I am wondering how bagging is different from that of cross validation? For a large dataset these are doing the same ‘kind’ of process. Could you please explain that?

Reply
- Jason Brownlee March 28, 2019 at 8:26 am #
  
  Similar, in that they both perform resampling, but different ends.
  
  In bagging we seek high variance because we are creating an ensemble for prediction.
  
  In CV we seek a balance between bias and variance to achieve a reliable estimate of model performance on unseen data.
  
  Reply
  - Vinu March 28, 2019 at 8:31 am #
    
    Thanks for clearing me that
    
    Reply
Abdoulaye Diallo April 15, 2019 at 8:49 am #

Hello how can plot accuracy in boostraping using sklearn classifier?

Reply
- Jason Brownlee April 15, 2019 at 2:38 pm #
  
  Do you mean estimating the performance of a classifier using the bootstrap?
  
  This tutorial will help:
  https://machinelearningmastery.com/a-gentle-introduction-to-the-bootstrap-method/
  
  Reply
safa May 15, 2019 at 3:17 am #

Hi, when we create an ensemble with bagging technique, where each model (lets say a cnn) is trained on a different dataset, test set is the same for each model and for the ensemble, but how about validation set that is used during training the models? Can I use the same dataset to validate all models, or do I need to use a different one for each of them since each model is trained on a different dataset. Thanks in advance!

Reply
- Jason Brownlee May 15, 2019 at 8:19 am #
  
  You can use the out of bag samples as a validation dataset for each bagged model.
  
  Reply
dkwih June 30, 2020 at 5:07 pm #

Thank yo so much!!

Reply
- Jason Brownlee July 1, 2020 at 5:52 am #
  
  You’re welcome!
  
  Reply
Mark May 11, 2021 at 12:20 pm #

Thank you for this tutorial. I have a question please.

I believe the line

max(set(predictions), key=predictions.count)

would be the same as

max(predictions, key=predictions.count)

Is that correct?

Reply
- Jason Brownlee May 12, 2021 at 6:06 am #
  
  No, the set() locates unique values.
  
  Reply
DominicM July 11, 2021 at 8:13 pm #

Am fairly new to this so likely is user error, am using pyCharm Community
=== did I miss something?? ===

============================= test session starts =============================
collecting … collected 1 item

ex_bagging_from_scratch.py::test_split ERROR [100%]
test setup failed
file D:\DomiPy\venv\ex_bagging_from_scratch.py, line 81
def test_split(index, value, dataset):
E fixture ‘index’ not found
> available fixtures: cache, capfd, capfdbinary, caplog, capsys, capsysbinary, doctest_namespace, monkeypatch, pytestconfig, record_property, record_testsuite_property, record_xml_attribute, recwarn, tmp_path, tmp_path_factory, tmpdir, tmpdir_factory
> use ‘pytest –fixtures [testpath]’ for help on them.

D:\DomiPy\venv\ex_bagging_from_scratch.py:81

Reply
- Jason Brownlee July 12, 2021 at 5:48 am #
  
  Sorry to hear that, perhaps some of these tips will help:
  https://machinelearningmastery.com/faq/single-faq/why-does-the-code-in-the-tutorial-not-work-for-me
  
  Reply
lima January 28, 2022 at 12:15 am #

Hi Jason,

Thank you very much for the informative tutorial!

I’m a beginner so please bear with me! 🙂

As far as I understand, this implementation looks similar to the algorithm in this paper https://doi.org/10.1016/S0031-3203(02)00121-8 which i’m trying to implement for a regression problem where the dataset is small (230 data points).
My question is: how can I change the code to use a linear regression with L2 regularization?

Reply
- James Carmichael January 28, 2022 at 10:31 am #
  
  Hi Lima,
  
  I’m eager to help, but I don’t have the capacity to customize the code for your specific needs.
  
  I get a lot of requests like this. I’m sure you can understand my rationale.
  
  I do have some ideas that might help:
  
  Perhaps I already have a tutorial with the change you’re asking for? Search the blog.
  Perhaps you can try to make the change yourself?
  Perhaps you can add a comment below the post with the change you need and I or another reader can make a suggestion?
  Perhaps you can hire a contractor or programmer to make the change?
  Perhaps you can post a description of the code needed on stackoverflow.com?
  
  Reply

Navigation

How to Implement Bagging From Scratch With Python

Descriptions

Bootstrap Aggregation Algorithm

Sonar Dataset

Tutorial

1. Bootstrap Resample

2. Sonar Dataset Case Study

Extensions

Review

Discover How to Code Algorithms From Scratch!

No Libraries, Just Python Code.

Finally, Pull Back the Curtain on
Machine Learning Algorithms

More On This Topic

31 Responses to How to Implement Bagging From Scratch With Python

Leave a Reply Click here to cancel reply.

Navigation

Descriptions

Bootstrap Aggregation Algorithm

Sonar Dataset

Tutorial

1. Bootstrap Resample

2. Sonar Dataset Case Study

Extensions

Review

Discover How to Code Algorithms From Scratch!

No Libraries, Just Python Code.

Finally, Pull Back the Curtain on Machine Learning Algorithms

More On This Topic

31 Responses to How to Implement Bagging From Scratch With Python

Leave a Reply Click here to cancel reply.

Finally, Pull Back the Curtain on
Machine Learning Algorithms