Last Updated on May 19, 2020

It is important to establish baseline performance on a predictive modeling problem.

A baseline provides a point of comparison for the more advanced methods that you evaluate later.

In this tutorial, you will discover how to implement baseline machine learning algorithms from scratch in Python.

After completing this tutorial, you will know:

- How to implement the random prediction algorithm.
- How to implement the zero rule prediction algorithm.

**Kick-start your project** with my new book Machine Learning Algorithms From Scratch, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

**Update Aug/2018**: Tested and updated to work with Python 3.6.

## Description

There are many machine learning algorithms to choose from. Hundreds in fact.

You must know whether the predictions for a given algorithm are good or not. But how do you know?

The answer is to use a baseline prediction algorithm. A baseline prediction algorithm provides a set of predictions that you can evaluate as you would any predictions for your problem, such as classification accuracy or RMSE.

The scores from these algorithms provide the required point of comparison when evaluating all other machine learning algorithms on your problem.

Once established, you can comment on how much better a given algorithm is as compared to the naive baseline algorithm, providing context on just how good a given method actually is.

The two most commonly used baseline algorithms are:

- Random Prediction Algorithm.
- Zero Rule Algorithm.

When starting on a new problem that is more sticky than a conventional classification or regression problem, it is a good idea to first devise a random prediction algorithm that is specific to your prediction problem. Later you can improve upon this and devise a zero rule algorithm.

Let’s implement these algorithms and see how they work.

## Tutorial

This tutorial is divided into 2 parts:

- Random Prediction Algorithm.
- Zero Rule Algorithm.

These steps will provide the foundations you need to handle implementing and calculating baseline performance for your machine learning algorithms.

### 1. Random Prediction Algorithm

The random prediction algorithm predicts a random outcome as observed in the training data.

It is perhaps the simplest algorithm to implement.

It requires that you store all of the distinct outcome values in the training data, which could be large on regression problems with lots of distinct values.

Because random numbers are used to make decisions, it is a good idea to fix the random number seed prior to using the algorithm. This is to ensure that we get the same set of random numbers, and in turn the same decisions each time the algorithm is run.

Below is an implementation of the Random Prediction Algorithm in a function named **random_algorithm()**.

The function takes both a training dataset that includes output values and a test dataset for which output values must be predicted.

The function will work for both classification and regression problems. It assumes that the output value in the training data is the final column for each row.

First, the set of unique output values is collected from the training data. Then a randomly selected output value from the set is selected for each row in the test set.

1 2 3 4 5 6 7 8 9 |
# Generate random predictions def random_algorithm(train, test): output_values = [row[-1] for row in train] unique = list(set(output_values)) predicted = list() for row in test: index = randrange(len(unique)) predicted.append(unique[index]) return predicted |

We can test this function with a small dataset that only contains the output column for simplicity.

The output values in the training dataset are either “0” or “1”, meaning that the set of predictions the algorithm will choose from is {0, 1}. The test set also contains a single column, with no data as the predictions are not known.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
from random import seed from random import randrange # Generate random predictions def random_algorithm(train, test): output_values = [row[-1] for row in train] unique = list(set(output_values)) predicted = list() for row in test: index = randrange(len(unique)) predicted.append(unique[index]) return predicted seed(1) train = [[0], [1], [0], [1], [0], [1]] test = [[None], [None], [None], [None]] predictions = random_algorithm(train, test) print(predictions) |

Running the example calculates random predictions for the test dataset and prints those predictions.

1 |
[0, 0, 1, 0] |

The random prediction algorithm is easy to implement and fast to run, but we could do better as a baseline.

### 2. Zero Rule Algorithm

The Zero Rule Algorithm is a better baseline than the random algorithm.

It uses more information about a given problem to create one rule in order to make predictions. This rule is different depending on the problem type.

Let’s start with classification problems, predicting a class label.

#### Classification

For classification problems, the one rule is to predict the class value that is most common in the training dataset. This means that if a training dataset has 90 instances of class “0” and 10 instances of class “1” that it will predict “0” and achieve a baseline accuracy of 90/100 or 90%.

This is much better than the random prediction algorithm that would only achieve 82% accuracy on average. For details on how this is estimate for random search is calculated, see below:

1 2 |
= ((0.9 * 0.9) + (0.1 * 0.1)) * 100 = 82% |

Below is a function named **zero_rule_algorithm_classification()** that implements this for the classification case.

1 2 3 4 5 6 |
# zero rule algorithm for classification def zero_rule_algorithm_classification(train, test): output_values = [row[-1] for row in train] prediction = max(set(output_values), key=output_values.count) predicted = [prediction for i in range(len(test))] return predicted |

The function makes use of the **max()** function with the key attribute, which is a little clever.

Given a list of class values observed in the training data, the **max()** function takes a set of unique class values and calls the count on the list of class values for each class value in the set.

The result is that it returns the class value that has the highest count of observed values in the list of class values observed in the training dataset.

If all class values have the same count, then we will choose the first class value observed in the dataset.

Once we select a class value, it is used to make a prediction for each row in the test dataset.

Below is a worked example with a contrived dataset that contains 4 examples of class “0” and 2 examples of class “1”. We would expect the algorithm to choose the class value “0” as the prediction for each row in the test dataset.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
from random import seed from random import randrange # zero rule algorithm for classification def zero_rule_algorithm_classification(train, test): output_values = [row[-1] for row in train] prediction = max(set(output_values), key=output_values.count) predicted = [prediction for i in range(len(test))] return predicted seed(1) train = [['0'], ['0'], ['0'], ['0'], ['1'], ['1']] test = [[None], [None], [None], [None]] predictions = zero_rule_algorithm_classification(train, test) print(predictions) |

Running this example makes the predictions and prints them to screen. As expected, the class value of “0” was chosen and predicted.

1 |
['0', '0', '0', '0', '0', '0'] |

Now, let’s see the Zero Rule Algorithm for regression problems.

#### Regression

Regression problems require the prediction of a real value.

A good default prediction for real values is to predict the central tendency. This could be the mean or the median.

A good default is to use the mean (also called the average) of the output value observed in the training data.

This is likely to have a lower error than random prediction which will return any observed output value.

Below is a function to do that named **zero_rule_algorithm_regression()**. It works by calculating the mean value for the observed output values.

1 |
mean = sum(value) / total values |

Once calculated, the mean is then predicted for each row in the training data.

1 2 3 4 5 6 7 8 |
from random import randrange # zero rule algorithm for regression def zero_rule_algorithm_regression(train, test): output_values = [row[-1] for row in train] prediction = sum(output_values) / float(len(output_values)) predicted = [prediction for i in range(len(test))] return predicted |

This function can be tested with a simple example.

We can contrive a small dataset where the mean value is known to be 15.

1 2 3 4 5 6 7 8 9 10 |
10 15 12 15 18 20 mean = (10 + 15 + 12 + 15 + 18 + 20) / 6 mean = 90 / 6 mean = 15 |

Below is the complete example. We would expect that the mean value of 15 will be predicted for each of the 4 rows in the test dataset.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
from random import seed from random import randrange # zero rule algorithm for regression def zero_rule_algorithm_regression(train, test): output_values = [row[-1] for row in train] prediction = sum(output_values) / float(len(output_values)) predicted = [prediction for i in range(len(test))] return predicted seed(1) train = [[10], [15], [12], [15], [18], [20]] test = [[None], [None], [None], [None]] predictions = zero_rule_algorithm_regression(train, test) print(predictions) |

Running the example calculates the predicted output values that are printed. As expected, the mean value of 15 is predicted for each row in the test dataset.

1 |
[15.0, 15.0, 15.0, 15.0, 15.0, 15.0] |

## Extensions

Below are a few extensions to the baseline algorithms that you may wish to investigate an implement as an extension to this tutorial.

- Alternate Central Tendency where the median, mode or other central tendency calculations are predicted instead of the mean.
- Moving Average for time series problems where the mean of the last n records is predicted.

## Review

In this tutorial, you discovered the importance of calculating a baseline of performance on your machine learning problem.

You now know:

- How to implement a random prediction algorithm for classification and regression problems.
- How to implement a zero rule algorithm for classification and regression problems.

**Do you have any questions?**

Ask your questions in the comments and I will do my best to answer.

Hi Jason,

Thank you for the article.

Guess, in

`zero_rule_algorithm_classification`

and`zero_rule_algorithm_regression`

predicted values should be generated from the len(test) instead of len(train).Yes, great catch. Fixed.

Thanks Grigoriy.

hi, Jason!

I did not quite understand the code above!

from random import seed

from random import randrange

def random_algorithm(train, test):

output_values = [row[-1] for row in train]

unique = list(set(output_values))

predicted = list()

for row in test:

index = randrange(len(unique))

predicted.append(unique[index])

return predicted

seed(1)

train = [[0], [1], [0], [1], [0], [1]]

test = [[None], [None], [None], [None]]

predictions = random_algorithm(train, test)

print(predictions)

What are def, row[-1] and all those things?

I started learning ML just a week ago.

What should I start from to understand this page completely?

Thank you!

Hi,

row[-1] refers to the last item in a Python list.

You might be best served by reading up on Python syntax.

Should not it always be 50% when using random prediction? The 82% is for when using 90% : 10% prediction for each sample.

Your data is often not balanced 50/50 between two classes.

You may have imbalanced data, you may have more than two classes.

is it always desirable to have balanced (50/50) data in training set? what is the influence of balanced training set on classifier performance on test set

It varies from problem to problem. You can do a sensitivity analysis and answer this specifically for your data, in fact, I encourage it.

Hi Jason,

Thanks a lot for this great post.

I think the question from Wei Zhang was based on the fact that the random prediction algorithm you’ve provided above does not account for the number of occurrences in the train set, but it only finds the unique elements and then randomly generates output values using a uniform distribution. Please correct me if I’m wrong, but following this method, the accuracy equation becomes (assuming 90-10 split in data): 0.9*0.5 + 0.1*0.5 = 0.5 or 50%

Yes, I believe you are correct.

I will schedule time to update the post.

from random import seed

from random import randrange

Above is not really used in zero_rule_algorithm_regression()

Thanks, you could probably ignore those imports.

In this above regression example

Answer is [15.0, 15.0, 15.0, 15.0] instead [15.0, 15.0, 15.0, 15.0,15.0,15.0]

because length of test dataset is 4 not 6

Yea this is what I think too. The length is 4

can you explain how you calculate 82% like that please?

thank you in advance

I do show exactly, what is the problem you are having?

I have a number of values that need modelling e.g (blue, red, green, yellow).

How do I work out the total accuracy of both methods? Do I just work out the accuracy of each value compared to the complete set and average them?

e.g 10 blues out of 100 values means 10% for zero rule for blue but then I do the same for each value?

I’m not even sure what do to do for random as there are so many values in my training set. Do I just compare both the prediction array with my training set?

Yes, the accuracy for each class can be reported.

You can also create a confusion matrix:

https://machinelearningmastery.com/confusion-matrix-machine-learning/

Hello Jason,

I was wondering whether we can find the RMSE, MAE, etc. in the Random Prediction Algorithm ? (As we do for the ZeroR in weka)

Yes, you can calculate each metric.

Hello Jason,

Thank you for your post.

it is useful for the how you create a simple algorithm.

I think there is a small failure in the code when the second time the function zero_rule_algorithm create.

the prediction variable is based on the train variable and i think it is the test variable.

“predicted = [prediction for i in range(len(train))]”

You’re welcome.

Thanks! Fixed.

Hi Jason,

I’m new in ML and now I learning about face recognition using neural networks and deep learning. I trained my dataset on the FaceNet model and I want to check the model performance using Benchmark and baseline comparing but I don’t know how I should start, Do you have any tutorials that may help me or give me an idea about that.

Thanks.

Yes, see this tutorial:

https://machinelearningmastery.com/how-to-develop-a-face-recognition-system-using-facenet-in-keras-and-an-svm-classifier/

Thanks for excellent post Dr. Brownlee! I see how zero rule algorithm is better than random algorithm for metric like accuracy. But for metric like precision, recall, f1, that rely on positive class predictions, random algorithm I think makes more sense because zero rule will make these metrics equal to zero. Does it make sense to take maximum score from both algorithm to create baseline for multiple metrics? Basically, running zero rule algorithm for metrics like accuracy and random algorithm for precision, recall, etc. and then I have baseline for multiple metrics? Thanks for sharing your knowledge.

Good question, see this:

https://machinelearningmastery.com/naive-classifiers-imbalanced-classification-metrics/

Fantastic, thank you!

You’re welcome.

Hello Dr. Brownlee,

I’m very confused about how this was derived for the accuracy of the random prediction algorithm on the example:

((0.9 * 0.9) + (0.1 * 0.1)) * 100

Looks like other people here understood, but I’m lost on the context of the numbers here.

The only thing I understand is that 0.9 = 90/100 occurrences of class 0 and 0.1 = 10/100 occurrences of class 1

Why are we multiplying 0.9 * 0.9? or why is 0.1 * 0.1 happening? I’m really lost 🙁

Good question, this tutorial does a better job of explaining the probability beyond the expected performance of the random guessing classifier:

https://machinelearningmastery.com/dont-use-random-guessing-as-your-baseline-classifier/

And this:

https://machinelearningmastery.com/how-to-develop-and-evaluate-naive-classifier-strategies-using-probability/