How To Implement The Perceptron Algorithm From Scratch In Python

The Perceptron algorithm is the simplest type of artificial neural network.

It is a model of a single neuron that can be used for two-class classification problems and provides the foundation for later developing much larger networks.

In this tutorial, you will discover how to implement the Perceptron algorithm from scratch with Python.

After completing this tutorial, you will know:

  • How to train the network weights for the Perceptron.
  • How to make predictions with the Perceptron.
  • How to implement the Perceptron algorithm for a real-world classification problem.

Let’s get started.

Update Jan/2017: Changed the calculation of fold_size in cross_validation_split() to always be an integer. Fixes issues with Python 3.

How To Implement The Perceptron Algorithm From Scratch In Python

How To Implement The Perceptron Algorithm From Scratch In Python
Photo by Les Haines, some rights reserved.


This section provides a brief introduction to the Perceptron algorithm and the Sonar dataset to which we will later apply it.

Perceptron Algorithm

The Perceptron is inspired by the information processing of a single neural cell called a neuron.

A neuron accepts input signals via its dendrites, which pass the electrical signal down to the cell body.

In a similar way, the Perceptron receives input signals from examples of training data that we weight and combined in a linear equation called the activation.

The activation is then transformed into an output value or prediction using a transfer function, such as the step transfer function.

In this way, the Perceptron is a classification algorithm for problems with two classes (0 and 1) where a linear equation (like or hyperplane) can be used to separate the two classes.

It is closely related to linear regression and logistic regression that make predictions in a similar way (e.g. a weighted sum of inputs).

The weights of the Perceptron algorithm must be estimated from your training data using stochastic gradient descent.

Stochastic Gradient Descent

Gradient Descent is the process of minimizing a function by following the gradients of the cost function.

This involves knowing the form of the cost as well as the derivative so that from a given point you know the gradient and can move in that direction, e.g. downhill towards the minimum value.

In machine learning, we can use a technique that evaluates and updates the weights every iteration called stochastic gradient descent to minimize the error of a model on our training data.

The way this optimization algorithm works is that each training instance is shown to the model one at a time. The model makes a prediction for a training instance, the error is calculated and the model is updated in order to reduce the error for the next prediction.

This procedure can be used to find the set of weights in a model that result in the smallest error for the model on the training data.

For the Perceptron algorithm, each iteration the weights (w) are updated using the equation:

Where w is weight being optimized, learning_rate is a learning rate that you must configure (e.g. 0.01), (expected – predicted) is the prediction error for the model on the training data attributed to the weight and x is the input value.

Sonar Dataset

The dataset we will use in this tutorial is the Sonar dataset.

This is a dataset that describes sonar chirp returns bouncing off different services. The 60 input variables are the strength of the returns at different angles. It is a binary classification problem that requires a model to differentiate rocks from metal cylinders.

It is a well-understood dataset. All of the variables are continuous and generally in the range of 0 to 1. As such we will not have to normalize the input data, which is often a good practice with the Perceptron algorithm. The output variable is a string “M” for mine and “R” for rock, which will need to be converted to integers 1 and 0.

By predicting the class with the most observations in the dataset (M or mines) the Zero Rule Algorithm can achieve an accuracy of 53%.

You can learn more about this dataset at the UCI Machine Learning repository. You can download the dataset for free and place it in your working directory with the filename sonar.all-data.csv.


This tutorial is broken down into 3 parts:

  1. Making Predictions.
  2. Training Network Weights.
  3. Modeling the Sonar Dataset.

These steps will give you the foundation to implement and apply the Perceptron algorithm to your own classification predictive modeling problems.

1. Making Predictions

The first step is to develop a function that can make predictions.

This will be needed both in the evaluation of candidate weights values in stochastic gradient descent, and after the model is finalized and we wish to start making predictions on test data or new data.

Below is a function named predict() that predicts an output value for a row given a set of weights.

The first weight is always the bias as it is standalone and not responsible for a specific input value.

We can contrive a small dataset to test our prediction function.

We can also use previously prepared weights to make predictions for this dataset.

Putting this all together we can test our predict() function below.

There are two inputs values (X1 and X2) and three weight values (bias, w1 and w2). The activation equation we have modeled for this problem is:

Or, with the specific weight values we chose by hand as:

Running this function we get predictions that match the expected output (y) values.

Now we are ready to implement stochastic gradient descent to optimize our weight values.

2. Training Network Weights

We can estimate the weight values for our training data using stochastic gradient descent.

Stochastic gradient descent requires two parameters:

  • Learning Rate: Used to limit the amount each weight is corrected each time it is updated.
  • Epochs: The number of times to run through the training data while updating the weight.

These, along with the training data will be the arguments to the function.

There are 3 loops we need to perform in the function:

  1. Loop over each epoch.
  2. Loop over each row in the training data for an epoch.
  3. Loop over each weight and update it for a row in an epoch.

As you can see, we update each weight for each row in the training data, each epoch.

Weights are updated based on the error the model made. The error is calculated as the difference between the expected output value and the prediction made with the candidate weights.

There is one weight for each input attribute, and these are updated in a consistent way, for example:

The bias is updated in a similar way, except without an input as it is not associated with a specific input value:

Now we can put all of this together. Below is a function named train_weights() that calculates weight values for a training dataset using stochastic gradient descent.

You can see that we also keep track of the sum of the squared error (a positive value) each epoch so that we can print out a nice message each outer loop.

We can test this function on the same small contrived dataset from above.

We use a learning rate of 0.1 and train the model for only 5 epochs, or 5 exposures of the weights to the entire training dataset.

Running the example prints a message each epoch with the sum squared error for that epoch and the final set of weights.

You can see how the problem is learned very quickly by the algorithm.

Now, let’s apply this algorithm on a real dataset.

3. Modeling the Sonar Dataset

In this section, we will train a Perceptron model using stochastic gradient descent on the Sonar dataset.

The example assumes that a CSV copy of the dataset is in the current working directory with the file name sonar.all-data.csv.

The dataset is first loaded, the string values converted to numeric and the output column is converted from strings to the integer values of 0 to 1. This is achieved with helper functions load_csv(), str_column_to_float() and str_column_to_int() to load and prepare the dataset.

We will use k-fold cross validation to estimate the performance of the learned model on unseen data. This means that we will construct and evaluate k models and estimate the performance as the mean model error. Classification accuracy will be used to evaluate each model. These behaviors are provided in the cross_validation_split(), accuracy_metric() and evaluate_algorithm() helper functions.

We will use the predict() and train_weights() functions created above to train the model and a new perceptron() function to tie them together.

Below is the complete example.

A k value of 3 was used for cross-validation, giving each fold 208/3 = 69.3 or just under 70 records to be evaluated upon each iteration. A learning rate of 0.1 and 500 training epochs were chosen with a little experimentation.

You can try your own configurations and see if you can beat my score.

Running this example prints the scores for each of the 3 cross-validation folds then prints the mean classification accuracy.

We can see that the accuracy is about 73%, higher than the baseline value of just over 50% if we only predicted the majority class using the Zero Rule Algorithm.


This section lists extensions to this tutorial that you may wish to consider exploring.

  • Tune The Example. Tune the learning rate, number of epochs and even data preparation method to get an improved score on the dataset.
  • Batch Stochastic Gradient Descent. Change the stochastic gradient descent algorithm to accumulate updates across each epoch and only update the weights in a batch at the end of the epoch.
  • Additional Regression Problems. Apply the technique to other classification problems on the UCI machine learning repository.

Did you explore any of these extensions?
Let me know about it in the comments below.


In this tutorial, you discovered how to implement the Perceptron algorithm using stochastic gradient descent from scratch with Python.

You learned.

  • How to make predictions for a binary classification problem.
  • How to optimize a set of weights using stochastic gradient descent.
  • How to apply the technique to a real classification predictive modeling problem.

Do you have any questions?
Ask your question in the comments below and I will do my best to answer.

Want to Code Algorithms in Python Without Math?

Machine Learning Algorithms From Scratch

Code Your First Algorithm in Minutes

…with step-by-step tutorials on real-world datasets

Discover how in my new Ebook:
Machine Learning Algorithms From Scratch

It covers 18 tutorials with all the code for 12 top algorithms, like:
Linear Regression, k-Nearest Neighbors, Stochastic Gradient Descent and much more…

Finally, Pull Back the Curtain on
Machine Learning Algorithms

Skip the Academics. Just Results.

Click to learn more.

54 Responses to How To Implement The Perceptron Algorithm From Scratch In Python

  1. Philip Brierley November 2, 2016 at 7:07 am #

    There is a derivation of the backprop learning rule at and also similar code in a bunch of other languages from Fortran to c to php.

    With help we did get it working in Python, with some nice plots that show the learning proceeding.

    • Jason Brownlee November 2, 2016 at 9:10 am #

      Thanks for sharing Philip.

      • Misge November 4, 2016 at 3:44 pm #

        Sorry to bother you but I want to understand whats wrong in using your code? I think you also used someone else’s code right? At least you read and reimplemented it. I hope my question will not offend you.

        • Jason Brownlee November 5, 2016 at 7:28 am #

          I wrote the code from scratch myself.

          The code works, what problem are you having exactly?

  2. Andre Logunov November 3, 2016 at 12:06 pm #

    Hi, Jason!

    A very informative web-site you’ve got! I’m thinking of making a compilation of ML materials including yours. I wonder if I could use your wonderful tutorials in a book on ML in Russian provided of course your name will be mentioned? It’s just a thought so far.

    • Jason Brownlee November 4, 2016 at 9:03 am #

      No Andre, please do not use my materials in your book.

  3. Stefan November 4, 2016 at 3:12 am #

    Thanks for the interesting lesson. I’m reviewing the code now but I’m confused, where are the train and test values in the perceptron function coming from? I can’t find their origin.

    • Stefan Lop November 4, 2016 at 6:38 am #

      I’m also receiving a ValueError(“empty range for randrange()”) error, the script seems to loop through a couple of randranges in the cross_validation_split function before erroring, not sure why. Was the script you posted supposed to work out of the box? Because I cannot get it to work and have been using the exact same data set you are working with.

      • Jason Brownlee November 4, 2016 at 11:15 am #

        Hi Stefan, sorry to hear that you are having problems.

        Yes, the script works out of the box on Python 2.7.

        Perhaps there was a copy-paste error?
        Perhaps you are on a different platform like Python 3 and the script needs to be modified slightly?

        Are you able to share more details?

        • Stefan November 5, 2016 at 12:15 am #

          Was running Python 3, works fine in 2 haha thanks!

      • Jason Brownlee January 3, 2017 at 9:52 am #

        I have updated the cross_validation_split() function in the above example to address issues with Python 3.

    • Jason Brownlee November 4, 2016 at 9:13 am #

      In the full example, the code is not using train/test nut instead k-fold cross validation, which like multiple train/test evaluations.

      Learn more about the test harness here:

      • Stefan November 5, 2016 at 12:22 am #

        But the train and test arguments in the perceptron function must be populated by something, where is it? I can’t find anything that would pass a value to those train and test arguments.

        • Jason Brownlee November 5, 2016 at 7:31 am #

          Hi Stefan,

          The train and test arguments come from the call in evaluate_algorithm to algorithm() on line 67.

          Algorithm is a parameter which is passed in on line 114 as the perceptron() function.

          So, this means that each loop on line 58 that the train and test lists of observations come from the prepared cross-validation folds.

          To deeply understand this test harness code see the blog post dedicated to it here:

          • Stefan November 8, 2016 at 1:42 am #

            Oh boy, big time brain fart on my end I see it now. Thanks so much for your help, I’m really enjoying all of the tutorials you have provided so far.

          • Jason Brownlee November 8, 2016 at 9:54 am #

            I’m glad to hear you made some progress Stefan.

  4. Amita misra November 12, 2016 at 6:34 pm #

    Thanks for such a simple and basic introductory tutorial for deep learning. I had been trying to find something for months but it was all theano and tensor flow and left me intimidating. This is really a good place for a beginner like me.

  5. vedhavyas November 20, 2016 at 10:42 pm #

    Hi Jason,

    Implemented in Golang. Here are my results

    Id 2, predicted 53, total 70, accuracy 75.71428571428571
    Id 1, predicted 53, total 69, accuracy 76.81159420289855
    Id 0, predicted 52, total 69, accuracy 75.36231884057972
    mean accuracy 75.96273291925466

    no. of folds: 3
    learningRate: 0.01
    epochs: 500

    • Jason Brownlee November 22, 2016 at 6:48 am #

      Very nice work vedhavyas!

      Do you have a link to your golang version you can post?

  6. Tim November 22, 2016 at 8:32 pm #

    Hi Jason!

    Thanks for the great tutorial! A ‘from-scratch’ implementation always helps to increase the understanding of a mechanism.

    I have a question though: I thought to have read somewhere that in ‘stochastic’ gradient descent, the weights have to be initialised to a small random value (hence the “stochastic”) instead of zero, to prevent some nodes in the net from becoming or remaining inactive due to zero multiplication. I see in your gradient descent algorithm, you initialise the weights to zero. Could you elaborate some on the choice of the zero init value? My understanding may be incomplete, but this question popped up as I was reading.


    • Jason Brownlee November 23, 2016 at 8:57 am #

      This can help with convergence Tim, but is not strictly required as the example above demonstrates.

      • Tim November 23, 2016 at 7:40 pm #

        Thanks Jason! That clears it up!

  7. kero hakem December 22, 2016 at 1:55 am #

    Thanks for the great tutorial! but how i can use this perceptron in predicting multiple classes

  8. PN February 22, 2017 at 5:52 am #

    Thanks for your great website. I use part of your tutorials in my machine learning class if it’s allowed.

    • Jason Brownlee February 22, 2017 at 10:06 am #

      Yes, use them any way you want, please credit the source.

  9. Aniket Saxena March 20, 2017 at 3:01 am #

    Hello Sir, please tell me to visualize the progress and final result of my program, how I can use matplotlib to output an image for each iteration of algorithm.

    • Jason Brownlee March 20, 2017 at 8:17 am #

      You could create and save the image within the epoch loop.

  10. Aniket Saxena March 22, 2017 at 1:20 am #

    Hello Sir, as i have gone through the above code and found out the epoch loop in two functions like in def train_weights and def perceptron and since I’m a beginner in machine learning so please guide me how can i create and save the image within epoch loop to visualize output of perceptron algorithm at each iteration

    • Jason Brownlee March 22, 2017 at 8:08 am #

      Sorry, I do not have an example of graphing performance. Consider using matplotlib.

  11. Sahiba March 24, 2017 at 4:43 am #

    Hi Jason,

    Thank you for this explanation. I have a question – why isn’t the bias updating along with the weights?

  12. Aniket Saxena March 29, 2017 at 4:33 am #

    Hello Jason,
    Here in the above code i didn’t understand few lines in evaluate_algorithm function. Please guide me why we use these lines in train_set and row_copy.

    train_set = sum(train_set, [])


    row_copy[-1] = None

    • Jason Brownlee March 29, 2017 at 9:11 am #

      We clear the known outcome so the algorithm cannot cheat when being evaluated.

  13. Aniket Saxena March 29, 2017 at 1:13 pm #

    One more question that after assigning row_copy in test_set, why do we set the last element of row_copy to None, i.e.,
    row_copy[-1] = None

    • Jason Brownlee March 30, 2017 at 8:46 am #

      So that the outcome variable is not made available to the algorithm used to make a prediction.

  14. Aniket Saxena March 31, 2017 at 2:11 am #

    And there is a question that the lookup dictionary’s value is updated at every iteration of for loop in function str_column_to_int() and that we returns the lookup dictionary then why we use second for loop to update the rows of the dataset in the following lines :
    for i, value in enumerate(unique):
    lookup[value] = i
    for row in dataset:
    row[column] = lookup[row[column]]
    return lookup

    Does it affect the dataset values after having passed the lookup dictionary and if yes, does the dataset which have been passed to the function evaluate_algorithm() may also alter in the following function call statement :

    scores = evaluate_algorithm(dataset, perceptron, n_folds, l_rate, n_epoch)

  15. Michel May 17, 2017 at 7:17 am #

    Hello, I would like to understand 2 points of the code?
    1 ° because on line 10, you use train [0]?
    2 ° According to the formula of weights, w (t + 1) = w (t) + learning_rate * (expected (t) – predicted (t)) * x (t), then because it used in the code “weights [i + 1 ] = Weights [i + 1] + l_rate * error * row [i] “,
    Where does this plus 1 come from in the weigthts after equality?

    • Jason Brownlee May 17, 2017 at 8:44 am #

      Because the weight at index zero contains the bias term.

      • Michel May 25, 2017 at 2:36 am #

        Sorry, I still do not get it. Can you explain it a little better?

  16. Sri June 4, 2017 at 3:36 pm #

    Hi, I just finished coding the perceptron algorithm using stochastic gradient descent, i have some questions :

    1) When i train the perceptron on the entire sonar data set with the goal of reaching the minimum “the sum of squared errors of prediction” with learning rate=0.1 and number of epochs=500 the error get stuck at 40.

    What do i do to minimize this error?

    2) This question is regarding the k-fold cross validation test. A model trained on k folds must be less generalized compared to a model trained on the entire dataset. If this is true then how valid is the k-fold cross validation test?

    3) To find the best combination of “learning rate” and “no. of epochs” looks like the real trick behind the learning process. How to find this best combination?

    • Jason Brownlee June 5, 2017 at 7:39 am #

      You could try different configurations of learning rate and epochs.

      k-fold cross validation gives a more robust estimate of the skill of the model when making predictions on new data compared to a train/test split, at least in general.

      There is no “Best” anything in machine learning, just lots of empirical trial and error to see what works well enough for your problem domain:

  17. Vaibhav Rai July 18, 2017 at 5:44 pm #

    Hello sir!
    Can you help me fixing out an error in the randrange function.
    ValueError: empty range for randrange()

    • Jason Brownlee July 19, 2017 at 8:21 am #

      This may be a python 2 vs python 3 things. I used Python 2 in the development of the example.

      • Vaibhav Rai July 19, 2017 at 4:07 pm #

        actually I changed the mydata_copy with mydata in cross_validation_split to correct that error but now a key error:137 is occuring there.

        • Jason Brownlee July 19, 2017 at 4:11 pm #

          Are you able to post more information about your environment (Python version) and the error (the full trace)?

          • Vaibhav Rai July 19, 2017 at 5:13 pm #

            Sir my python version is 3.6 and the error is
            KeyError: 137

          • Jason Brownlee July 20, 2017 at 6:17 am #

            Sorry, the example was developed for Python 2.7.

            I believe the code requires modification to work in Python 3.

  18. Vaibhav Rai July 21, 2017 at 4:38 pm #

    Can you please tell me which other function can we use to do the job of generating indices in place of randrange.

  19. Alex Godfrey August 29, 2017 at 12:31 am #

    How is the baseline value of just over 50% arrived at?

Leave a Reply