How to Implement Stacked Generalization (Stacking) From Scratch With Python

By Jason Brownlee on August 13, 2019 in Code Algorithms From Scratch 33

Code a Stacking Ensemble From Scratch in Python, Step-by-Step.

Ensemble methods are an excellent way to improve predictive performance on your machine learning problems.

Stacked Generalization or stacking is an ensemble technique that uses a new model to learn how to best combine the predictions from two or more models trained on your dataset.

In this tutorial, you will discover how to implement stacking from scratch in Python.

After completing this tutorial, you will know:

How to learn to combine the predictions from multiple models on a dataset.
How to apply stacked generalization to a real-world predictive modeling problem.

Kick-start your project with my new book Machine Learning Algorithms From Scratch, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Update Jan/2017: Changed the calculation of fold_size in cross_validation_split() to always be an integer. Fixes issues with Python 3.
Update Aug/2018: Tested and updated to work with Python 3.6.

How to Implementing Stacking From Scratch With Python
Photo by Kiran Foster, some rights reserved.

Description

This section provides a brief overview of the Stacked Generalization algorithm and the Sonar dataset used in this tutorial.

Stacked Generalization Algorithm

Stacked Generalization or stacking is an ensemble algorithm where a new model is trained to combine the predictions from two or more models already trained or your dataset.

The predictions from the existing models or submodels are combined using a new model, and as such stacking is often referred to as blending, as the predictions from sub-models are blended together.

It is typical to use a simple linear method to combine the predictions for submodels such as simple averaging or voting, to a weighted sum using linear regression or logistic regression.

Models that have their predictions combined must have skill on the problem, but do not need to be the best possible models. This means that you do not need to tune the submodels intently, as long as the model shows some advantage over a baseline prediction.

It is important that sub-models produce different predictions, so-called uncorrelated predictions. Stacking works best when the predictions that are combined are all skillful, but skillful in different ways. This may be achieved by using algorithms that use very different internal representations (trees compared to instances) and/or models trained on different representations or projections of the training data.

In this tutorial, we will look at taking two very different and untuned sub-models and combining their predictions with a simple logistic regression algorithm.

Sonar Dataset

The dataset we will use in this tutorial is the Sonar dataset.

This is a dataset that describes sonar chirp returns bouncing off different surfaces. The 60 input variables are the strength of the returns at different angles. It is a binary classification problem that requires a model to differentiate rocks from metal cylinders. There are 208 observations.

It is a well-understood dataset. All of the variables are continuous and generally in the range of 0 to 1. The output variable is a string “M” for mine and “R” for rock, which will need to be converted to integers 1 and 0.

By predicting the class with the most observations in the dataset (M or mines) the Zero Rule Algorithm can achieve an accuracy of about 53%.

You can learn more about this dataset at the UCI Machine Learning repository.

Download the dataset for free and place it in your working directory with the filename sonar.all-data.csv.

Tutorial

This tutorial is broken down into 3 steps:

Sub-models and Aggregator.
Combining Predictions.
Sonar Dataset Case Study.

These steps provide the foundation that you need to understand and implement stacking on your own predictive modeling problems.

1. Sub-models and Aggregator

We are going to use two models as submodels for stacking and a linear model as the aggregator model.

This part is divided into 3 sections:

Sub-model #1: k-Nearest Neighbors.
Sub-model #2: Perceptron.
Aggregator Model: Logistic Regression.

Each model will be described in terms of the functions used train the model and a function used to make predictions.

1.1 Sub-model #1: k-Nearest Neighbors

The k-Nearest Neighbors algorithm or kNN uses the entire training dataset as the model.

Therefore training the model involves retaining the training dataset. Below is a function named knn_model() that does just this.

# Prepare the kNN model
def knn_model(train):
	return train

# Prepare the kNN model

def knn_model(train):

return train

Making predictions involves finding the k most similar records in the training dataset and selecting the most common class values. The Euclidean distance function is used to calculate the similarity between new rows of data and rows in the training dataset.

Below are these helper functions that involve making predictions for a kNN model. The function euclidean_distance() calculates the distance between two rows of data, get_neighbors() locates all neighbors for in the training dataset for a new row of data and knn_predict() makes a prediction from the neighbors for a new row of data.

# Calculate the Euclidean distance between two vectors
def euclidean_distance(row1, row2):
	distance = 0.0
	for i in range(len(row1)-1):
		distance += (row1[i] - row2[i])**2
	return sqrt(distance)

# Locate neighbors for a new row
def get_neighbors(train, test_row, num_neighbors):
	distances = list()
	for train_row in train:
		dist = euclidean_distance(test_row, train_row)
		distances.append((train_row, dist))
	distances.sort(key=lambda tup: tup[1])
	neighbors = list()
	for i in range(num_neighbors):
		neighbors.append(distances[i][0])
	return neighbors

# Make a prediction with kNN
def knn_predict(model, test_row, num_neighbors=2):
	neighbors = get_neighbors(model, test_row, num_neighbors)
	output_values = [row[-1] for row in neighbors]
	prediction = max(set(output_values), key=output_values.count)
	return prediction

# Calculate the Euclidean distance between two vectors

def euclidean_distance(row1, row2):

distance = 0.0

for i in range(len(row1)-1):

distance += (row1[i] - row2[i])**2

return sqrt(distance)

# Locate neighbors for a new row

def get_neighbors(train, test_row, num_neighbors):

distances = list()

for train_row in train:

dist = euclidean_distance(test_row, train_row)

distances.append((train_row, dist))

distances.sort(key=lambda tup: tup[1])

neighbors = list()

for i in range(num_neighbors):

neighbors.append(distances[i][0])

return neighbors

# Make a prediction with kNN

def knn_predict(model, test_row, num_neighbors=2):

neighbors = get_neighbors(model, test_row, num_neighbors)

output_values = [row[-1] for row in neighbors]

prediction = max(set(output_values), key=output_values.count)

return prediction

You can see that the number of neighbors (k) is set to 2 as a default parameter on the knn_predict() function. This number was chosen with a little trial and error and was not tuned.

Now that we have the building blocks for a kNN model, let’s look at the Perceptron algorithm.

1.2 Sub-model #2: Perceptron

The model for the Perceptron algorithm is a set of weights learned from the training data.

In order to train the weights, many predictions need to be made on the training data in order to calculate error values. Therefore, both model training and prediction require a function for prediction.

Below are the helper functions for implementing the Perceptron algorithm. The perceptron_model() function trains the Perceptron model on the training dataset and perceptron_predict() is used to make a prediction for a row of data.

# Make a prediction with weights
def perceptron_predict(model, row):
	activation = model[0]
	for i in range(len(row)-1):
		activation += model[i + 1] * row[i]
	return 1.0 if activation >= 0.0 else 0.0

# Estimate Perceptron weights using stochastic gradient descent
def perceptron_model(train, l_rate=0.01, n_epoch=5000):
	weights = [0.0 for i in range(len(train[0]))]
	for epoch in range(n_epoch):
		for row in train:
			prediction = perceptron_predict(weights, row)
			error = row[-1] - prediction
			weights[0] = weights[0] + l_rate * error
			for i in range(len(row)-1):
				weights[i + 1] = weights[i + 1] + l_rate * error * row[i]
	return weights

# Make a prediction with weights

def perceptron_predict(model, row):

activation = model[0]

for i in range(len(row)-1):

activation += model[i + 1] * row[i]

return 1.0 if activation >= 0.0 else 0.0

# Estimate Perceptron weights using stochastic gradient descent

def perceptron_model(train, l_rate=0.01, n_epoch=5000):

weights = [0.0 for i in range(len(train[0]))]

for epoch in range(n_epoch):

for row in train:

prediction = perceptron_predict(weights, row)

error = row[-1] - prediction

weights[0] = weights[0] + l_rate * error

for i in range(len(row)-1):

weights[i + 1] = weights[i + 1] + l_rate * error * row[i]

return weights

The perceptron_model() model specifies both a learning rate and number of training epochs as default parameters. Again, these parameters were chosen with a little bit of trial and error, but were not tuned on the dataset.

We now have implementations for both sub-models, let’s look at implementing the aggregator model.

1.3 Aggregator Model: Logistic Regression

Like the Perceptron algorithm, Logistic Regression uses a set of weights, called coefficients, as the representation of the model.

And like the Perceptron algorithm, the coefficients are learned by iteratively making predictions on the training data and updating them.

Below are the helper functions for implementing the logistic regression algorithm. The logistic_regression_model() function is used to train the coefficients on the training dataset and logistic_regression_predict() is used to make a prediction for a row of data.

# Make a prediction with coefficients
def logistic_regression_predict(model, row):
	yhat = model[0]
	for i in range(len(row)-1):
		yhat += model[i + 1] * row[i]
	return 1.0 / (1.0 + exp(-yhat))

# Estimate logistic regression coefficients using stochastic gradient descent
def logistic_regression_model(train, l_rate=0.01, n_epoch=5000):
	coef = [0.0 for i in range(len(train[0]))]
	for epoch in range(n_epoch):
		for row in train:
			yhat = logistic_regression_predict(coef, row)
			error = row[-1] - yhat
			coef[0] = coef[0] + l_rate * error * yhat * (1.0 - yhat)
			for i in range(len(row)-1):
				coef[i + 1] = coef[i + 1] + l_rate * error * yhat * (1.0 - yhat) * row[i]
	return coef

# Make a prediction with coefficients

def logistic_regression_predict(model, row):

yhat = model[0]

for i in range(len(row)-1):

yhat += model[i + 1] * row[i]

return 1.0 / (1.0 + exp(-yhat))

# Estimate logistic regression coefficients using stochastic gradient descent

def logistic_regression_model(train, l_rate=0.01, n_epoch=5000):

coef = [0.0 for i in range(len(train[0]))]

for epoch in range(n_epoch):

for row in train:

yhat = logistic_regression_predict(coef, row)

error = row[-1] - yhat

coef[0] = coef[0] + l_rate * error * yhat * (1.0 - yhat)

for i in range(len(row)-1):

coef[i + 1] = coef[i + 1] + l_rate * error * yhat * (1.0 - yhat) * row[i]

return coef

The logistic_regression_model() defines a learning rate and number of epochs as default parameters, and as with the other algorithms, these parameters were found with a little trial and error and were not optimized.

Now that we have implementations of sub-models and the aggregator model, let’s see how we can combine the predictions from multiple models.

2. Combining Predictions

For a machine learning algorithm, learning how to combine predictions is much the same as learning from a training dataset.

A new training dataset can be constructed from the predictions of the sub-models, as follows:

Each row represents one row in the training dataset.
The first column contains predictions for each row in the training dataset made by the first sub-model, such as k-Nearest Neighbors.
The second column contains predictions for each row in the training dataset made by the second sub-model, such as the Perceptron algorithm.
The third column contains the expected output value for the row in the training dataset.

Below is a contrived example of what a constructed stacking dataset may look like:

kNN,	Per,	Y
0,	0	0
1,	0	1
0,	1	0
1,	1	1
0,	1	0

kNN, Per, Y

0, 0 0

1, 0 1

0, 1 0

1, 1 1

0, 1 0

A machine learning algorithm, such as logistic regression can then be trained on this new dataset. In essence, this new meta-algorithm learns how to best combine the prediction from multiple submodels.

Below is a function named to_stacked_row() that implements this procedure for creating new rows for this stacked dataset.

The function takes a list of models as input, these are used to make predictions. The function also takes a list of functions as input, one function used to make a prediction for each model. Finally, a single row from the training dataset is included.

A new row is constructed one column at a time. Predictions are calculated using each model and the row of training data. The expected output value from the training dataset row is then added as the last column to the row.

# Make predictions with sub-models and construct a new stacked row
def to_stacked_row(models, predict_list, row):
	stacked_row = list()
	for i in range(len(models)):
		prediction = predict_list[i](models[i], row)
		stacked_row.append(prediction)
	stacked_row.append(row[-1])
	return stacked_row

# Make predictions with sub-models and construct a new stacked row

def to_stacked_row(models, predict_list, row):

stacked_row = list()

for i in range(len(models)):

prediction = predict_list[i](models[i], row)

stacked_row.append(prediction)

stacked_row.append(row[-1])

return stacked_row

On some predictive modeling problems, it is possible to get an even larger boost by training the aggregated model on both the training row and the predictions made by sub-models.

This improvement gives the aggregator model both the context of all the data in the training row to help determine how and when to best combine the predictions of the sub-models.

We can update our to_stacked_row() function to include this by aggregating the training row (minus the final column) and the stacked row as created above.

Below is an updated version of the to_stacked_row() function that implements this improvement.

# Make predictions with sub-models and construct a new stacked row
def to_stacked_row(models, predict_list, row):
	stacked_row = list()
	for i in range(len(models)):
		prediction = predict_list[i](models[i], row)
		stacked_row.append(prediction)
	stacked_row.append(row[-1])
	return row[0:len(row)-1] + stacked_row

# Make predictions with sub-models and construct a new stacked row

def to_stacked_row(models, predict_list, row):

stacked_row = list()

for i in range(len(models)):

prediction = predict_list[i](models[i], row)

stacked_row.append(prediction)

stacked_row.append(row[-1])

return row[0:len(row)-1] + stacked_row

It is a good idea to try both approaches on your problem to see which works best.

Now that we have all of the pieces for stacked generalization, we can apply it to a real-world problem.

3. Sonar Dataset Case Study

In this section, we will apply the Stacking algorithm to the Sonar dataset.

The example assumes that a CSV copy of the dataset is in the current working directory with the filename sonar.all-data.csv.

The dataset is first loaded, the string values converted to numeric and the output column is converted from strings to the integer values of 0 to 1. This is achieved with helper functions load_csv(), str_column_to_float() and str_column_to_int() to load and prepare the dataset.

We will use k-fold cross validation to estimate the performance of the learned model on unseen data. This means that we will construct and evaluate k models and estimate the performance as the mean model error. Classification accuracy will be used to evaluate the model. These behaviors are provided in the cross_validation_split(), accuracy_metric() and evaluate_algorithm() helper functions.

We will use the k-Nearest Neighbors, Perceptron and Logistic Regression algorithms implemented above. We will also use our technique for creating the new stacked dataset defined in the previous step.

A new function name stacking() is developed. This function does 4 things:

It first trains a list of models (kNN and Perceptron).
It then uses the models to make predictions and create a new stacked dataset.
It then trains an aggregator model (logistic regression) on the stacked dataset.
It then uses the sub-models and aggregator model to make predictions on the test dataset.

The complete example is listed below.

# Test stacking on the sonar dataset
from random import seed
from random import randrange
from csv import reader
from math import sqrt
from math import exp

# Load a CSV file
def load_csv(filename):
	dataset = list()
	with open(filename, 'r') as file:
		csv_reader = reader(file)
		for row in csv_reader:
			if not row:
				continue
			dataset.append(row)
	return dataset

# Convert string column to float
def str_column_to_float(dataset, column):
	for row in dataset:
		row[column] = float(row[column].strip())

# Convert string column to integer
def str_column_to_int(dataset, column):
	class_values = [row[column] for row in dataset]
	unique = set(class_values)
	lookup = dict()
	for i, value in enumerate(unique):
		lookup[value] = i
	for row in dataset:
		row[column] = lookup[row[column]]
	return lookup

# Split a dataset into k folds
def cross_validation_split(dataset, n_folds):
	dataset_split = list()
	dataset_copy = list(dataset)
	fold_size = int(len(dataset) / n_folds)
	for i in range(n_folds):
		fold = list()
		while len(fold) < fold_size:
			index = randrange(len(dataset_copy))
			fold.append(dataset_copy.pop(index))
		dataset_split.append(fold)
	return dataset_split

# Calculate accuracy percentage
def accuracy_metric(actual, predicted):
	correct = 0
	for i in range(len(actual)):
		if actual[i] == predicted[i]:
			correct += 1
	return correct / float(len(actual)) * 100.0

# Evaluate an algorithm using a cross validation split
def evaluate_algorithm(dataset, algorithm, n_folds, *args):
	folds = cross_validation_split(dataset, n_folds)
	scores = list()
	for fold in folds:
		train_set = list(folds)
		train_set.remove(fold)
		train_set = sum(train_set, [])
		test_set = list()
		for row in fold:
			row_copy = list(row)
			test_set.append(row_copy)
			row_copy[-1] = None
		predicted = algorithm(train_set, test_set, *args)
		actual = [row[-1] for row in fold]
		accuracy = accuracy_metric(actual, predicted)
		scores.append(accuracy)
	return scores

# Calculate the Euclidean distance between two vectors
def euclidean_distance(row1, row2):
	distance = 0.0
	for i in range(len(row1)-1):
		distance += (row1[i] - row2[i])**2
	return sqrt(distance)

# Locate neighbors for a new row
def get_neighbors(train, test_row, num_neighbors):
	distances = list()
	for train_row in train:
		dist = euclidean_distance(test_row, train_row)
		distances.append((train_row, dist))
	distances.sort(key=lambda tup: tup[1])
	neighbors = list()
	for i in range(num_neighbors):
		neighbors.append(distances[i][0])
	return neighbors

# Make a prediction with kNN
def knn_predict(model, test_row, num_neighbors=2):
	neighbors = get_neighbors(model, test_row, num_neighbors)
	output_values = [row[-1] for row in neighbors]
	prediction = max(set(output_values), key=output_values.count)
	return prediction

# Prepare the kNN model
def knn_model(train):
	return train

# Make a prediction with weights
def perceptron_predict(model, row):
	activation = model[0]
	for i in range(len(row)-1):
		activation += model[i + 1] * row[i]
	return 1.0 if activation >= 0.0 else 0.0

# Estimate Perceptron weights using stochastic gradient descent
def perceptron_model(train, l_rate=0.01, n_epoch=5000):
	weights = [0.0 for i in range(len(train[0]))]
	for epoch in range(n_epoch):
		for row in train:
			prediction = perceptron_predict(weights, row)
			error = row[-1] - prediction
			weights[0] = weights[0] + l_rate * error
			for i in range(len(row)-1):
				weights[i + 1] = weights[i + 1] + l_rate * error * row[i]
	return weights

# Make a prediction with coefficients
def logistic_regression_predict(model, row):
	yhat = model[0]
	for i in range(len(row)-1):
		yhat += model[i + 1] * row[i]
	return 1.0 / (1.0 + exp(-yhat))

# Estimate logistic regression coefficients using stochastic gradient descent
def logistic_regression_model(train, l_rate=0.01, n_epoch=5000):
	coef = [0.0 for i in range(len(train[0]))]
	for epoch in range(n_epoch):
		for row in train:
			yhat = logistic_regression_predict(coef, row)
			error = row[-1] - yhat
			coef[0] = coef[0] + l_rate * error * yhat * (1.0 - yhat)
			for i in range(len(row)-1):
				coef[i + 1] = coef[i + 1] + l_rate * error * yhat * (1.0 - yhat) * row[i]
	return coef

# Make predictions with sub-models and construct a new stacked row
def to_stacked_row(models, predict_list, row):
	stacked_row = list()
	for i in range(len(models)):
		prediction = predict_list[i](models[i], row)
		stacked_row.append(prediction)
	stacked_row.append(row[-1])
	return row[0:len(row)-1] + stacked_row

# Stacked Generalization Algorithm
def stacking(train, test):
	model_list = [knn_model, perceptron_model]
	predict_list = [knn_predict, perceptron_predict]
	models = list()
	for i in range(len(model_list)):
		model = model_list[i](train)
		models.append(model)
	stacked_dataset = list()
	for row in train:
		stacked_row = to_stacked_row(models, predict_list, row)
		stacked_dataset.append(stacked_row)
	stacked_model = logistic_regression_model(stacked_dataset)
	predictions = list()
	for row in test:
		stacked_row = to_stacked_row(models, predict_list, row)
		stacked_dataset.append(stacked_row)
		prediction = logistic_regression_predict(stacked_model, stacked_row)
		prediction = round(prediction)
		predictions.append(prediction)
	return predictions

# Test stacking on the sonar dataset
seed(1)
# load and prepare data
filename = 'sonar.all-data.csv'
dataset = load_csv(filename)
# convert string attributes to integers
for i in range(len(dataset[0])-1):
	str_column_to_float(dataset, i)
# convert class column to integers
str_column_to_int(dataset, len(dataset[0])-1)
n_folds = 3
scores = evaluate_algorithm(dataset, stacking, n_folds)
print('Scores: %s' % scores)
print('Mean Accuracy: %.3f%%' % (sum(scores)/float(len(scores))))

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

# Test stacking on the sonar dataset

from random import seed

from random import randrange

from csv import reader

from math import sqrt

from math import exp

# Load a CSV file

def load_csv(filename):

dataset = list()

with open(filename, 'r') as file:

csv_reader = reader(file)

for row in csv_reader:

if not row:

continue

dataset.append(row)

return dataset

# Convert string column to float

def str_column_to_float(dataset, column):

for row in dataset:

row[column] = float(row[column].strip())

# Convert string column to integer

def str_column_to_int(dataset, column):

class_values = [row[column] for row in dataset]

unique = set(class_values)

lookup = dict()

for i, value in enumerate(unique):

lookup[value] = i

for row in dataset:

row[column] = lookup[row[column]]

return lookup

# Split a dataset into k folds

def cross_validation_split(dataset, n_folds):

dataset_split = list()

dataset_copy = list(dataset)

fold_size = int(len(dataset) / n_folds)

for i in range(n_folds):

fold = list()

while len(fold) < fold_size:

index = randrange(len(dataset_copy))

fold.append(dataset_copy.pop(index))

dataset_split.append(fold)

return dataset_split

# Calculate accuracy percentage

def accuracy_metric(actual, predicted):

correct = 0

for i in range(len(actual)):

if actual[i] == predicted[i]:

correct += 1

return correct / float(len(actual)) * 100.0

# Evaluate an algorithm using a cross validation split

def evaluate_algorithm(dataset, algorithm, n_folds, *args):

folds = cross_validation_split(dataset, n_folds)

scores = list()

for fold in folds:

train_set = list(folds)

train_set.remove(fold)

train_set = sum(train_set, [])

test_set = list()

for row in fold:

row_copy = list(row)

test_set.append(row_copy)

row_copy[-1] = None

predicted = algorithm(train_set, test_set, *args)

actual = [row[-1] for row in fold]

accuracy = accuracy_metric(actual, predicted)

scores.append(accuracy)

return scores

# Calculate the Euclidean distance between two vectors

def euclidean_distance(row1, row2):

distance = 0.0

for i in range(len(row1)-1):

distance += (row1[i] - row2[i])**2

return sqrt(distance)

# Locate neighbors for a new row

def get_neighbors(train, test_row, num_neighbors):

distances = list()

for train_row in train:

dist = euclidean_distance(test_row, train_row)

distances.append((train_row, dist))

distances.sort(key=lambda tup: tup[1])

neighbors = list()

for i in range(num_neighbors):

neighbors.append(distances[i][0])

return neighbors

# Make a prediction with kNN

def knn_predict(model, test_row, num_neighbors=2):

neighbors = get_neighbors(model, test_row, num_neighbors)

output_values = [row[-1] for row in neighbors]

prediction = max(set(output_values), key=output_values.count)

return prediction

# Prepare the kNN model

def knn_model(train):

return train

# Make a prediction with weights

def perceptron_predict(model, row):

activation = model[0]

for i in range(len(row)-1):

activation += model[i + 1] * row[i]

return 1.0 if activation >= 0.0 else 0.0

# Estimate Perceptron weights using stochastic gradient descent

def perceptron_model(train, l_rate=0.01, n_epoch=5000):

weights = [0.0 for i in range(len(train[0]))]

for epoch in range(n_epoch):

for row in train:

prediction = perceptron_predict(weights, row)

error = row[-1] - prediction

weights[0] = weights[0] + l_rate * error

for i in range(len(row)-1):

weights[i + 1] = weights[i + 1] + l_rate * error * row[i]

return weights

# Make a prediction with coefficients

def logistic_regression_predict(model, row):

yhat = model[0]

for i in range(len(row)-1):

yhat += model[i + 1] * row[i]

return 1.0 / (1.0 + exp(-yhat))

# Estimate logistic regression coefficients using stochastic gradient descent

def logistic_regression_model(train, l_rate=0.01, n_epoch=5000):

coef = [0.0 for i in range(len(train[0]))]

for epoch in range(n_epoch):

for row in train:

yhat = logistic_regression_predict(coef, row)

error = row[-1] - yhat

coef[0] = coef[0] + l_rate * error * yhat * (1.0 - yhat)

for i in range(len(row)-1):

coef[i + 1] = coef[i + 1] + l_rate * error * yhat * (1.0 - yhat) * row[i]

return coef

# Make predictions with sub-models and construct a new stacked row

def to_stacked_row(models, predict_list, row):

stacked_row = list()

for i in range(len(models)):

prediction = predict_list[i](models[i], row)

stacked_row.append(prediction)

stacked_row.append(row[-1])

return row[0:len(row)-1] + stacked_row

# Stacked Generalization Algorithm

def stacking(train, test):

model_list = [knn_model, perceptron_model]

predict_list = [knn_predict, perceptron_predict]

models = list()

for i in range(len(model_list)):

model = model_list[i](train)

models.append(model)

stacked_dataset = list()

for row in train:

stacked_row = to_stacked_row(models, predict_list, row)

stacked_dataset.append(stacked_row)

stacked_model = logistic_regression_model(stacked_dataset)

predictions = list()

for row in test:

stacked_row = to_stacked_row(models, predict_list, row)

stacked_dataset.append(stacked_row)

prediction = logistic_regression_predict(stacked_model, stacked_row)

prediction = round(prediction)

predictions.append(prediction)

return predictions

# Test stacking on the sonar dataset

seed(1)

# load and prepare data

filename = 'sonar.all-data.csv'

dataset = load_csv(filename)

# convert string attributes to integers

for i in range(len(dataset[0])-1):

str_column_to_float(dataset, i)

# convert class column to integers

str_column_to_int(dataset, len(dataset[0])-1)

n_folds = 3

scores = evaluate_algorithm(dataset, stacking, n_folds)

print('Scores: %s' % scores)

print('Mean Accuracy: %.3f%%' % (sum(scores)/float(len(scores))))

A k value of 3 was used for cross-validation, giving each fold 208/3 = 69.3 or just under 70 records to be evaluated upon each iteration.

Running the example prints the scores and mean of the scores for the final configuration.

Scores: [78.26086956521739, 76.81159420289855, 69.56521739130434]
Mean Accuracy: 74.879%

1 2	Scores: [78.26086956521739, 76.81159420289855, 69.56521739130434] Mean Accuracy: 74.879%

Extensions

This section lists extensions to this tutorial that you may be interested in exploring.

Tune Algorithms. The algorithms used for the submodels and the aggregate model in this tutorial were not tuned. Explore alternate configurations and see if you can further lift performance.
Prediction Correlations. Stacking works better if the predictions of submodels are weakly correlated. Implement calculations to estimate the correlation between the predictions of submodels.
Different Sub-models. Implement more and different sub-models to be combined using the stacking procedure.
Different Aggregating Model. Experiment with simpler models (like averaging and voting) and more complex aggregation models to see if you can boost performance.
More Datasets. Apply stacking to more datasets on the UCI Machine Learning Repository.

Did you explore any of these extensions?
Share your experiences in the comments below.

Review

In this tutorial, you discovered how to implement the stacking algorithm from scratch in Python.

Specifically, you learned:

How to combine the predictions from multiple models.
How to apply stacking to a real-world predictive modeling problem.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

33 Responses to How to Implement Stacked Generalization (Stacking) From Scratch With Python

George Liu November 18, 2016 at 1:59 pm #

Always great contents! Thanks Jason!

Reply
- Jason Brownlee November 19, 2016 at 8:42 am #
  
  thanks George.
  
  Reply
Kevin Tian December 20, 2016 at 3:30 am #

Thank you for this really helpful tutorial!

I just have a question regarding how to create the stacked_row; in this tutorial it seems we train the sub-models on the train set and use their predictions on the same train set to build stacked_rows, on which the aggregator is trained and then directly used for the test set.

However I have seen some people doing this in a different way. They split the train set in parts, say A and B; train sub-models on A/B respectively and then use their predictions on B/A as input for the second layer aggregator. I just wonder which one is the proper approach and why?

Thank you !

Reply
- Jason Brownlee December 20, 2016 at 7:29 am #
  
  Hi Kevin,
  
  There are many ways to build an ensemble. What you describe sounds like a form of the random subspace method.
  
  Nevertheless, I would suggest you to try a suite of methods for a problem and see what works best. There is no “one best way”, and it is our job as practitioners to narrow down the options fast and zoom in on what works.
  
  Reply
  - Michael July 14, 2018 at 9:12 am #
    
    First of all, thank you for the helpful blog post…really informative.
    
    To comment on Kevin’s note though, I think the way the code here is written makes the overall classifier potentially very prone to overfitting. For example, imagine if one of the sub-models is a 1NN classifier…in this case, if it’s trained on a given test set and then used to predict on that same test set it will have an accuracy of 1.0. If the blender is then trained on this same train set it will put all the weight on the 1NN classifier. Bad idea though as this will not generalize very well.
    
    I’ve read a bit about this and it seems like a good approach is to indeed do k-fold CV but when building the blender to use the output of the sub-models on the held out fold, not what they were trained on.
    
    or perhaps I’m missing something here?
    
    Reply
    - Jason Brownlee July 15, 2018 at 6:03 am #
      
      Sounds like a good approach.
      
      Reply
Reza April 5, 2017 at 6:09 pm #

Hi Jason,
Thanks for your good post, as always it was great.
Could you propose me a solution for imbalanced dataset stacking?

Reply
- Jason Brownlee April 9, 2017 at 2:33 pm #
  
  Sorry I don’t understand, perhaps you can rephrase your question?
  
  Reply
Tim June 29, 2017 at 4:37 pm #

Hey Jason great work. Anw I see you have not use out-of-sample data to train the first level models to create the meta features for second level aggregator model. wouldn’t this cause a information leakage problem? Could you pls clarify on this thanks

Reply
- Jason Brownlee June 30, 2017 at 8:09 am #
  
  As long as test data is left out. I expect your suggestion would lead to more robust results though.
  
  Reply
nehasharma August 31, 2017 at 5:43 am #

Hi Jason,

I have a query for line number 168: stacked_dataset.append(stacked_row)
in test set.where have u used this?

Reply
- Jason Brownlee August 31, 2017 at 6:29 am #
  
  The stacking() function is the “algorithm” that makes a prediction.
  
  The stacked_dataset list contains the predictions from the sub models that is fed into logistic regression to make the final outcome prediction.
  
  Does that help?
  
  Reply
Alexander September 20, 2017 at 4:11 pm #

Thank you, Jason. Beautiful work.

Reply
- Jason Brownlee September 21, 2017 at 5:35 am #
  
  Thanks Alexander.
  
  Reply
Alexander October 21, 2017 at 6:16 pm #

Jason, help me please.
This is question about pipeline.
Imagine, that we decide to use unsupervised learning algorithm at first for search of clusters. After that we develop Neural network. Maybe you know how can we configure network to emphasis that observation are from different clusters? (we don’t want to construct many networks, but want to configure one)

Reply
- Jason Brownlee October 22, 2017 at 5:18 am #
  
  What is the goal of this approach?
  
  If you want to split the data into clusters to develop specialized models, then you will need to develop separate models.
  
  You could build one model and use the cluster id as another input feature.
  
  Reply
Alexander October 22, 2017 at 10:56 pm #

Thank you, Jason.

I want to find a way how can I transfer my knowledges about sample (in our example knowledges about clusters) to network.
It is possible to widen possible understanding of learning task?
Imagine the Network which has many inputs, we have features space at right hand and additional information which characterize these features at left hand. And we don’t want to use our knowledges about features like additional standard feature, but find new opportunities.

If the question is not clear, please don’t worry. I have many thoughts in my head, thoughts which are very interest for me. But, sometimes I can’t clear explain what I want to people around me.

Reply
- Jason Brownlee October 23, 2017 at 5:44 am #
  
  Perhaps this post will help you define your problem Alexander:
  https://machinelearningmastery.com/how-to-define-your-machine-learning-problem/
  
  Reply
Alexander October 23, 2017 at 5:23 pm #

Thank you.

Reply
- Jason Brownlee October 24, 2017 at 5:27 am #
  
  I’m glad it helped.
  
  Reply
Alexander January 16, 2018 at 2:47 am #

Hey Jason, I have a question. When differentiating for instance, AdaBoostClassifier with Stacking could it be that AdaBoostClassifier uses only one model while Stacking used several machine learning algorithm to combine them to make a prediction. Also, AdaBoostClassifier uses hard voting which basically means that it takes the mode of each model (the prediction). The thing is I thought before AdaBoostClassifiers where able to hold different machine learning algorithms, but now I realize that AdaBoostClassifiers can only have one type of machine learning algorithms that dosent mean that you could have 5 DecisionTreeClassifiers in an AdaBoostClassifier algorithm which is controlled by the hyperparameter “n_estimators”. I just want to know the main difference between these two ensembles, its a bit confusing. Thanks for the help.

Reply
- Jason Brownlee January 16, 2018 at 7:39 am #
  
  Internally adaboost uses many trees where subsequent trees correct predictions of prior trees.
  
  Stacking just lets multiple different models votes on an outcome.
  
  Reply
Nitin January 17, 2018 at 2:16 pm #

Hi Jason,

Great post as always!!

I have a question if i have a data set where
train.shape, test.shape
((1458, 301), (1459, 301))

and i train 7 models with the train dataset and create a new dataset with these 7 models prediction (new_dataset.shape = (1458, 7))

I fit a new model with this new dataset and Y from train.

Now, if i use this new model to predict for test, i am getting the error:
ValueError: feature_names mismatch,

its because the new_dataset has 7 columns and the test has 301 columns.

not sure, what i am doing wrong. Kindly suggest.

Reply
- Jason Brownlee January 18, 2018 at 10:04 am #
  
  Sorry, I don’t follow.
  
  Reply
- Danna February 14, 2018 at 6:58 am #
  
  Nitin,
  
  You also need to create a new feature set for the test data set. You can use the entire original training set to make predictions with each model using the original test data and combine these into a matrix that will have dimension (1459, 7).
  
  Reply
Anjali Bhavan March 26, 2018 at 6:29 am #

Great post as always! I had one query though – do you have any tips as to what all component classifiers (aggregators and submodels) we can choose for our ensemble for best results?

Reply
- Jason Brownlee March 26, 2018 at 10:04 am #
  
  What do you mean exactly?
  
  Reply
Nirupama January 3, 2021 at 3:22 am #

My output scores are accuracy are very low as shown below. What are the possible reasons?
Scores: [1.4492753623188406, 0.0, 0.0]
Mean Accuracy: 0.483%

Reply
- Jason Brownlee January 3, 2021 at 5:59 am #
  
  Perhaps these tips will help:
  https://machinelearningmastery.com/machine-learning-performance-improvement-cheat-sheet/
  
  Reply
Swati Matwankar Shah July 8, 2021 at 12:57 pm #

Hello Jason,

Thank you for your post on implementing ensemble from scratch. I have a question about my implementation scenario below.

I have a complete dataset with columns(features) 1-25. I already have model-1, trained with features 3 to 10. I have model-2, trained with features 15 to 21. Can I create an ensemble of model-1 and model-2, to check if it improves the prediction accuracy? Which ensemble I should use in this case?

I went through “python” Bagging/Boosting/Stacking classifiers description and examples. However, none of them allows me to specify the range of columns (features) as an input. The only attribute “max_features” does not satisfy my need to feed in “chosen” columns to an ensemble.

Is there any other python way of implementing this (if not ensemble)? I do not want any random-ness in choosing the features for my scenario.

Regards,
Swati.

Reply
- Swati Matwankar Shah July 8, 2021 at 3:53 pm #
  
  Hello Json,
  
  I just realized that I can use “stacking ensemble”. Here I can train the two models independently, with different features. I can feed their “outcomes”/”predictions” to the “stacking ensemble” to get the final prediction.
  
  Earlier, I was under the wrong assumption that I have to feed in the raw training data to the ensemble. And then the ensemble will iteratively train the embedded models and give us the final prediction. (There are some ensemble models which do that. But that is not my requirement.)
  
  Reply
  - Jason Brownlee July 9, 2021 at 5:04 am #
    
    Yes, try it and see how you go.
    
    Reply
- Jason Brownlee July 9, 2021 at 5:02 am #
  
  Perhaps compare the performance of each model alone to an average, weighted average and stacking ensemble of the two models.
  
  There are many examples that will help, start here:
  https://machinelearningmastery.com/start-here/#ensemble
  
  Reply

Navigation

How to Implement Stacked Generalization (Stacking) From Scratch With Python

Code a Stacking Ensemble From Scratch in Python, Step-by-Step.

Description

Stacked Generalization Algorithm

Sonar Dataset

Tutorial

1. Sub-models and Aggregator

1.1 Sub-model #1: k-Nearest Neighbors

1.2 Sub-model #2: Perceptron

1.3 Aggregator Model: Logistic Regression

2. Combining Predictions

3. Sonar Dataset Case Study

Extensions

Review

Discover How to Code Algorithms From Scratch!

No Libraries, Just Python Code.

Finally, Pull Back the Curtain on
Machine Learning Algorithms

More On This Topic

33 Responses to How to Implement Stacked Generalization (Stacking) From Scratch With Python

Leave a Reply Click here to cancel reply.

Navigation

Code a Stacking Ensemble From Scratch in Python, Step-by-Step.

Description

Stacked Generalization Algorithm

Sonar Dataset

Tutorial

1. Sub-models and Aggregator

1.1 Sub-model #1: k-Nearest Neighbors

1.2 Sub-model #2: Perceptron

1.3 Aggregator Model: Logistic Regression

2. Combining Predictions

3. Sonar Dataset Case Study

Extensions

Review

Discover How to Code Algorithms From Scratch!

No Libraries, Just Python Code.

Finally, Pull Back the Curtain on Machine Learning Algorithms

More On This Topic

33 Responses to How to Implement Stacked Generalization (Stacking) From Scratch With Python

Leave a Reply Click here to cancel reply.

Finally, Pull Back the Curtain on
Machine Learning Algorithms