The k-Nearest Neighbors algorithm (or kNN for short) is an easy algorithm to understand and to implement, and a powerful tool to have at your disposal.

In this tutorial you will implement the k-Nearest Neighbors algorithm from scratch in Python (2.7). The implementation will be specific for classification problems and will be demonstrated using the Iris flowers classification problem.

This tutorial is for you if you are a Python programmer, or a programmer who can pick-up python quickly, and you are interested in how to implement the k-Nearest Neighbors algorithm from scratch.

## What is k-Nearest Neighbors

The model for kNN is the entire training dataset. When a prediction is required for a unseen data instance, the kNN algorithm will search through the training dataset for the k-most similar instances. The prediction attribute of the most similar instances is summarized and returned as the prediction for the unseen instance.

The similarity measure is dependent on the type of data. For real-valued data, the Euclidean distance can be used. Other other types of data such as categorical or binary data, Hamming distance can be used.

In the case of regression problems, the average of the predicted attribute may be returned. In the case of classification, the most prevalent class may be returned.

## How does k-Nearest Neighbors Work

The kNN algorithm is belongs to the family of instance-based, competitive learning and lazy learning algorithms.

Instance-based algorithms are those algorithms that model the problem using data instances (or rows) in order to make predictive decisions. The kNN algorithm is an extreme form of instance-based methods because all training observations are retained as part of the model.

It is a competitive learning algorithm, because it internally uses competition between model elements (data instances) in order to make a predictive decision. The objective similarity measure between data instances causes each data instance to compete to “win” or be most similar to a given unseen data instance and contribute to a prediction.

Lazy learning refers to the fact that the algorithm does not build a model until the time that a prediction is required. It is lazy because it only does work at the last second. This has the benefit of only including data relevant to the unseen data, called a localized model. A disadvantage is that it can be computationally expensive to repeat the same or similar searches over larger training datasets.

Finally, kNN is powerful because it does not assume anything about the data, other than a distance measure can be calculated consistently between any two instances. As such, it is called non-parametric or non-linear as it does not assume a functional form.

## Classify Flowers Using Measurements

The test problem we will be using in this tutorial is iris classification.

The problem is comprised of 150 observations of iris flowers from three different species. There are 4 measurements of given flowers: sepal length, sepal width, petal length and petal width, all in the same unit of centimeters. The predicted attribute is the species, which is one of setosa, versicolor or virginica.

It is a standard dataset where the species is known for all instances. As such we can split the data into training and test datasets and use the results to evaluate our algorithm implementation. Good classification accuracy on this problem is above 90% correct, typically 96% or better.

You can download the dataset for free from iris.data, see the resources section for further details.

## How to implement k-Nearest Neighbors in Python

This tutorial is broken down into the following steps:

**Handle**Data: Open the dataset from CSV and split into test/train datasets.**Similarity**: Calculate the distance between two data instances.**Neighbors**: Locate k most similar data instances.**Response**: Generate a response from a set of data instances.**Accuracy**: Summarize the accuracy of predictions.**Main**: Tie it all together.

### 1. Handle Data

The first thing we need to do is load our data file. The data is in CSV format without a header line or any quotes. We can open the file with the open function and read the data lines using the reader function in the csv module.

1 2 3 4 5 |
import csv with open('iris.data', 'rb') as csvfile: lines = csv.reader(csvfile) for row in lines: print ', '.join(row) |

Next we need to split the data into a training dataset that kNN can use to make predictions and a test dataset that we can use to evaluate the accuracy of the model.

We first need to convert the flower measures that were loaded as strings into numbers that we can work with. Next we need to split the data set randomly into train and datasets. A ratio of 67/33 for train/test is a standard ratio used.

Pulling it all together, we can define a function called **loadDataset** that loads a CSV with the provided filename and splits it randomly into train and test datasets using the provided split ratio.

1 2 3 4 5 6 7 8 9 10 11 12 13 |
import csv import random def loadDataset(filename, split, trainingSet=[] , testSet=[]): with open(filename, 'rb') as csvfile: lines = csv.reader(csvfile) dataset = list(lines) for x in range(len(dataset)-1): for y in range(4): dataset[x][y] = float(dataset[x][y]) if random.random() < split: trainingSet.append(dataset[x]) else: testSet.append(dataset[x]) |

Download the iris flowers dataset CSV file to the local directory. We can test this function out with our iris dataset, as follows:

1 2 3 4 5 |
trainingSet=[] testSet=[] loadDataset('iris.data', 0.66, trainingSet, testSet) print 'Train: ' + repr(len(trainingSet)) print 'Test: ' + repr(len(testSet)) |

### 2. Similarity

In order to make predictions we need to calculate the similarity between any two given data instances. This is needed so that we can locate the k most similar data instances in the training dataset for a given member of the test dataset and in turn make a prediction.

Given that all four flower measurements are numeric and have the same units, we can directly use the Euclidean distance measure. This is defined as the square root of the sum of the squared differences between the two arrays of numbers (read that again a few times and let it sink in).

Additionally, we want to control which fields to include in the distance calculation. Specifically, we only want to include the first 4 attributes. One approach is to limit the euclidean distance to a fixed length, ignoring the final dimension.

Putting all of this together we can define the **euclideanDistance** function as follows:

1 2 3 4 5 6 |
import math def euclideanDistance(instance1, instance2, length): distance = 0 for x in range(length): distance += pow((instance1[x] - instance2[x]), 2) return math.sqrt(distance) |

We can test this function with some sample data, as follows:

1 2 3 4 |
data1 = [2, 2, 2, 'a'] data2 = [4, 4, 4, 'b'] distance = euclideanDistance(data1, data2, 3) print 'Distance: ' + repr(distance) |

### 3. Neighbors

Now that we have a similarity measure, we can use it collect the k most similar instances for a given unseen instance.

This is a straight forward process of calculating the distance for all instances and selecting a subset with the smallest distance values.

Below is the **getNeighbors** function that returns k most similar neighbors from the training set for a given test instance (using the already defined **euclideanDistance** function)

1 2 3 4 5 6 7 8 9 10 11 12 |
import operator def getNeighbors(trainingSet, testInstance, k): distances = [] length = len(testInstance)-1 for x in range(len(trainingSet)): dist = euclideanDistance(testInstance, trainingSet[x], length) distances.append((trainingSet[x], dist)) distances.sort(key=operator.itemgetter(1)) neighbors = [] for x in range(k): neighbors.append(distances[x][0]) return neighbors |

We can test out this function as follows:

1 2 3 4 5 |
trainSet = [[2, 2, 2, 'a'], [4, 4, 4, 'b']] testInstance = [5, 5, 5] k = 1 neighbors = getNeighbors(trainSet, testInstance, 1) print(neighbors) |

### 4. Response

Once we have located the most similar neighbors for a test instance, the next task is to devise a predicted response based on those neighbors.

We can do this by allowing each neighbor to vote for their class attribute, and take the majority vote as the prediction.

Below provides a function for getting the majority voted response from a number of neighbors. It assumes the class is the last attribute for each neighbor.

1 2 3 4 5 6 7 8 9 10 11 |
import operator def getResponse(neighbors): classVotes = {} for x in range(len(neighbors)): response = neighbors[x][-1] if response in classVotes: classVotes[response] += 1 else: classVotes[response] = 1 sortedVotes = sorted(classVotes.iteritems(), key=operator.itemgetter(1), reverse=True) return sortedVotes[0][0] |

We can test out this function with some test neighbors, as follows:

1 2 3 |
neighbors = [[1,1,1,'a'], [2,2,2,'a'], [3,3,3,'b']] response = getResponse(neighbors) print(response) |

This approach returns one response in the case of a draw, but you could handle such cases in a specific way, such as returning no response or selecting an unbiased random response.

### 5. Accuracy

We have all of the pieces of the kNN algorithm in place. An important remaining concern is how to evaluate the accuracy of predictions.

An easy way to evaluate the accuracy of the model is to calculate a ratio of the total correct predictions out of all predictions made, called the classification accuracy.

Below is the **getAccuracy** function that sums the total correct predictions and returns the accuracy as a percentage of correct classifications.

1 2 3 4 5 6 |
def getAccuracy(testSet, predictions): correct = 0 for x in range(len(testSet)): if testSet[x][-1] is predictions[x]: correct += 1 return (correct/float(len(testSet))) * 100.0 |

We can test this function with a test dataset and predictions, as follows:

1 2 3 4 |
testSet = [[1,1,1,'a'], [2,2,2,'a'], [3,3,3,'b']] predictions = ['a', 'a', 'a'] accuracy = getAccuracy(testSet, predictions) print(accuracy) |

### 6. Main

We now have all the elements of the algorithm and we can tie them together with a main function.

Below is the complete example of implementing the kNN algorithm from scratch in Python.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 |
# Example of kNN implemented from Scratch in Python import csv import random import math import operator def loadDataset(filename, split, trainingSet=[] , testSet=[]): with open(filename, 'rb') as csvfile: lines = csv.reader(csvfile) dataset = list(lines) for x in range(len(dataset)-1): for y in range(4): dataset[x][y] = float(dataset[x][y]) if random.random() < split: trainingSet.append(dataset[x]) else: testSet.append(dataset[x]) def euclideanDistance(instance1, instance2, length): distance = 0 for x in range(length): distance += pow((instance1[x] - instance2[x]), 2) return math.sqrt(distance) def getNeighbors(trainingSet, testInstance, k): distances = [] length = len(testInstance)-1 for x in range(len(trainingSet)): dist = euclideanDistance(testInstance, trainingSet[x], length) distances.append((trainingSet[x], dist)) distances.sort(key=operator.itemgetter(1)) neighbors = [] for x in range(k): neighbors.append(distances[x][0]) return neighbors def getResponse(neighbors): classVotes = {} for x in range(len(neighbors)): response = neighbors[x][-1] if response in classVotes: classVotes[response] += 1 else: classVotes[response] = 1 sortedVotes = sorted(classVotes.iteritems(), key=operator.itemgetter(1), reverse=True) return sortedVotes[0][0] def getAccuracy(testSet, predictions): correct = 0 for x in range(len(testSet)): if testSet[x][-1] == predictions[x]: correct += 1 return (correct/float(len(testSet))) * 100.0 def main(): # prepare data trainingSet=[] testSet=[] split = 0.67 loadDataset('iris.data', split, trainingSet, testSet) print 'Train set: ' + repr(len(trainingSet)) print 'Test set: ' + repr(len(testSet)) # generate predictions predictions=[] k = 3 for x in range(len(testSet)): neighbors = getNeighbors(trainingSet, testSet[x], k) result = getResponse(neighbors) predictions.append(result) print('> predicted=' + repr(result) + ', actual=' + repr(testSet[x][-1])) accuracy = getAccuracy(testSet, predictions) print('Accuracy: ' + repr(accuracy) + '%') main() |

Running the example, you will see the results of each prediction compared to the actual class value in the test set. At the end of the run, you will see the accuracy of the model. In this case, a little over 98%.

1 2 3 4 5 6 7 |
... > predicted='Iris-virginica', actual='Iris-virginica' > predicted='Iris-virginica', actual='Iris-virginica' > predicted='Iris-virginica', actual='Iris-virginica' > predicted='Iris-virginica', actual='Iris-virginica' > predicted='Iris-virginica', actual='Iris-virginica' Accuracy: 98.0392156862745% |

## Ideas For Extensions

This section provides you with ideas for extensions that you could apply and investigate with the Python code you have implemented as part of this tutorial.

**Regression**: You could adapt the implementation to work for regression problems (predicting a real-valued attribute). The summarization of the closest instances could involve taking the mean or the median of the predicted attribute.**Normalization**: When the units of measure differ between attributes, it is possible for attributes to dominate in their contribution to the distance measure. For these types of problems, you will want to rescale all data attributes into the range 0-1 (called normalization) before calculating similarity. Update the model to support data normalization.**Alternative Distance Measure**: There are many distance measures available, and you can even develop your own domain-specific distance measures if you like. Implement an alternative distance measure, such as Manhattan distance or the vector dot product.

There are many more extensions to this algorithm you might like to explore. Two additional ideas include support for distance-weighted contribution for the k-most similar instances to the prediction and more advanced data tree-based structures for searching for similar instances.

## Resource To Learn More

This section will provide some resources that you can use to learn more about the k-Nearest Neighbors algorithm in terms of both theory of how and why it works and practical concerns for implementing it in code.

### Problem

### Code

This section links to open source implementations of kNN in popular machine learning libraries. Review these if you are considering implementing your own version of the method for operational use.

### Books

You may have one or more books on applied machine learning. This section highlights the sections or chapters in common applied books on machine learning that refer to k-Nearest Neighbors.

- Applied Predictive Modeling, pages 159 and 350.
- Data Mining: Practical Machine Learning Tools and Techniques, Third Edition (The Morgan Kaufmann Series in Data Management Systems), pages 76, 128 and 235.
- Machine Learning for Hackers, Chapter 10.
- Machine Learning in Action, Chapter 2.
- Programming Collective Intelligence: Building Smart Web 2.0 Applications, Chapters 2 and 8 and page 293.

## Get your FREE Algorithms Mind Map

I've created a handy mind map of 60+ algorithms organized by type.

Download it, print it and use it.

Also get exclusive access to the machine learning algorithms email mini-course.

## Tutorial Summary

In this tutorial you learned about the k-Nearest Neighbor algorithm, how it works and some metaphors that you can use to think about the algorithm and relate it to other algorithms. You implemented the kNN algorithm in Python from scratch in such a way that you understand every line of code and can adapt the implementation to explore extensions and to meet your own project needs.

Below are the 5 key learnings from this tutorial:

**k-Nearest Neighbor**: A simple algorithm to understand and implement, and a powerful non-parametric method.**Instanced-based method**: Model the problem using data instances (observations).**Competitive-learning**: Learning and predictive decisions are made by internal competition between model elements.**Lazy-learning**: A model is not constructed until it is needed in order to make a prediction.**Similarity Measure**: Calculating objective distance measures between data instances is a key feature of the algorithm.

Did you implement kNN using this tutorial? How did you go? What did you learn?

**Need Help Getting Past The Math?**

Finally understand how machine learning algorithms work, step-by-step in the new Ebook:

Master Machine Learning Algorithms

Take the next step with **12 self-study tutorials** across**10 top machine learning algorithms**.

Includes spreadsheets that show exactly how everything is calculated.

Ideal for beginners with no math background.

Jason –

I appreciate your step-by-step approach. Your explanation makes this material accessible for a wide audience.

Keep up the great contributions.

Thanks Damian!

A very interesting and clear article. I haven’t tried it out yet but will over the weekend.

Thanks.

Thanks Pete, let me know how you go.

Hey Jason, I’ve ploughed through multiple books and tutorials but your explanation helped me to finally understand what I was doing.

Looking forward to more of your tutorials.

Thanks Alan!

Hey Jason!

Thank you for awesome article!

Clear and straight forward explanation. I finaly understood the background under kNN.

p.s.

There’s some code errors in the article.

1) in getResponse it should be “return sortedVote[0]” instead sortedVotes[0][0]

2) in getAccuracy it should be “testSet[x][-1] IN predictions[x]” instead of IS.

Thanks Vadim!

I think the code is right, but perhaps I misunderstood your comments.

If you change getResponse to return sortedVote[0] you will get the class and the count. We don’t want this, we just want the class.

In getAccuracy, I am interested in an equality between the class strings (is), not a set operation (in).

Does that make sense?

Thank you very much for this example!

You’re welcome Mario.

Thank you for the post on kNN implementation..

Any pointers on normalization will be greatly appreciated ?

What if the set of features includes fields like name, age, DOB, ID ? What are good algorithms to normalize such features ?

Hey PVA, great question.

Notmalization is just the rescaling of numerical attributes between 0-1. Tools like scikit-learn can do it for you if you like, here’s a recipe: http://machinelearningmastery.com/rescaling-data-for-machine-learning-in-python-with-scikit-learn/

You can compute distances between strings using methods like edit distance, learn more here: http://en.wikipedia.org/wiki/Edit_distance

DOB – well the distance between two dates could be in days, hours or whatever makes sense in your domain.

ID might just be useful as some kind of indirect marker of “when the entry was added to the database” if you don’t have a “record create time”.

I hope this helps.

A million thanks !

I’ve had so many starting points for my ML journey, but few have been this clear.

Merci !

Glad to here it Landry!

Hi,

when i run the code it shows

ValueError: could not convert string to float: ‘sepallength’

what should i do to run the program.

please help me out as soon as early….

thanks in advance…

Hi kumaran,

I believe the example code still works just fine. If I copy-paste the code from the tutorial into a new file called knn.py and download iris.data into the same directory, the example runs fine for me using Python 2.7.

Did you modify the example in some way perhaps?

Hi jabson ,

Thanks for your reply..

I am using Anaconda IDE 3.4 .

yes it works well for the iris dataset If i try to put some other dataset it shows value error because those datasets contains strings along with the integers..

example forestfire datasets.

X Y month day FFMC DMC DC ISI temp RH wind rain area

7 5 mar fri 86.2 26.2 94.3 5.1 8.2 51 6.7 0 0

7 4 oct tue 90.6 35.4 669.1 6.7 18 33 0.9 0 0

7 4 oct sat 90.6 43.7 686.9 6.7 14.6 33 1.3 0 0

8 6 mar fri 91.7 33.3 77.5 9 8.3 97 4 0.2 0

8 6 mar sun 89.3 51.3 102.2 9.6 11.4 99 1.8 0 0

Is it possible to classify these datasets also with your code??

please provide me if some other classifer code example in python…

Excellent article on knn. It made the concepts so clear.

Thanks sanksh!

I like how it is explained, simply and clear. Great job.

Thanks!

Great article Jason !! Crisp and Clear.

Nice artical Jason. I am a software engineer new to ML. Your step by step approach made learning easy and fun. Though Python was new to me, it became very easy since I could run small small snippet instead of try to understand the entire program in once.

Appreciate your hardwork. Keep it up.

Thanks Raju.

It’s really fantastic for me. I can’t find a better one

I also face the same problem with Kumaran. After checking, I think the problem “can’t convert string into float” is that the first row is “sepal_length” and so on. Python can’t convert it since it’s totally string. So just delete it or change the code a little.

Hi,

Many thanks for this details article. Any clue for the extension Ideas?

Thanks,

RK

Hi – I was wondering how we can have the data fed into the system without randomly shuffling as I am trying to make a prediction on the final line of data?

Do we remove:

if random.random() < split

and replace with something like:

if len(trainingSet)/len(dataset) < split

# if < 0.67 then append to the training set, otherwise append to test set

The reason I ask is that I know what data I want to predict and with this it seems that it could use the data I want to predict within the training set due to the random selection process.

I also have the same dilemma as you, I performed trial and error, right now I cant seem to make things right which code be omitted to create a prediction.

I am not a software engineer nor I have a background in computer science. I am pretty new to data science and ML as well, I just started learning Python and R but the experience is GREAT!

Thanks so much for this Jason!

This article was absolutely gorgeous. As a computational physicist grad student who has taken an interest in machine learning this was the perfect level to skim, get my hands dirty and have some fun.

Thank you so much for the article on this. I’m excited to see the rest of your site.

Thanks for the article!

I wished to write my own knn python program, and that was really helpful !

Thanks a lot for sharing this.

One thing you didn’t mention though is how you chose k=3.

To get a feeling of how sensitive is the accuracy % to k, i wrote a “screening” function that iterates over k on the training set using leave-one-out cross validation accuracy % as a ranking.

Would you have any other suggestions ?

This is really really helpful. Thanks man !!

An incredibly useful tutorial, Jason. Thank you for this.

Please could you show me how you would modify your code to work with a data set which comprises strings (i.e. text) and not numerical values?

I’m really keen to try this algorithm on text data but can’t seem to find a decent article on line.

Your help is much appreciated.

Mark

Nice tutorial! Very helpful in explaining KNN — python is so much easier to understand than the mathematical operations. One thing though — the way the range function works for Python is that the final element is not included.

In loadDataset() you have

`for x in range(len(dataset)-1):`

This should simply be:

`for x in range(len(dataset)):`

otherwise the last row of data is omitted!

Thank you so much

great

thank very much

That’s great! I’ve tried so many books and articles to start learning ML. Your article is the first clear one! Thank you a lot! Please, keep teaching us!)

Thanks Gleb!

Hi Jason,

Thanks for this amazing introduction! I have two questions that relate to my study on this.

First is, how is optimization implemented in this code?

Second is, what is the strength of the induction this algorithm is making as explained above, will this is be a useful induction for a thinking machine?

Thank you so much!

HI jason;

it is great tutorial it help me alot thanks for great effort but i have queastion what if i want to split the data in to randomly 100 training set and 50 test set and i want to generate in separate file with there values instead of printing total numbers? becaouse i want to test them in hugin

thank you so much!

Hi Jason,

It is a really great tutorial. Your article is so clear, but I have a problem.

When I run code, I see the right classification.

> predicted=’Iris-virginica’, actual=’Iris-virginica’

> predicted=’Iris-virginica’, actual=’Iris-virginica’

> predicted=’Iris-virginica’, actual=’Iris-virginica’

> predicted=’Iris-virginica’, actual=’Iris-virginica’

…

However, accuracy is 0%. I run accuracy test but there is no problem with code.

How can I fix the accuracy? Where do I make mistake?

Thanks for reply and your helps.

Hi, I solved this doing this:

Originaly, on the step 5, in the function getAccuracy you have:

…

for x in range(len(testSet)):

if testSet[x][-1] is predictions[x]:

correct += 1

…

The key here is in the IF statement:

if testSet[x][-1] is predictions[x]:

Change “IS” to “==” so the getAccuracy now is:

…

for x in range(len(testSet)):

if testSet[x][-1] == predictions[x]:

correct += 1

…

That solve the problem and works ok!!

I think setting the value of K plays an important role in the accuracy of the prediction. How to determine the best value of ‘K’ . Please suggest some best practices ?

Dear, How to do it for muticlass classifcation with data in excelsheet: images of digits(not handwritten) and label of that image in corresponding next column of excel ??

Your this tutorial is totally on numeric data, just gave me the idea with images.

Very clear explanation and step by step working make this very understandable. I am not sure why the list sortedVotes within the function getResponse is reversed, I thought getResponse is meant to return the most common key in the dictionary classVotes. If you reverse the list, doesn’t this return the least common key in the dictionary?

I do not know how to take the k nearest neighbour for 3 classes for ties vote for example [1,1,2,2,0]. Since for two classes, with k=odd values, we do find the maximum vote for the two classes but ties happens if we choose three classes.

Thanks in advance

hi

thanks for this great effort buddy

i have some basic questions:

1: i opened “iris.data’ file and it is simply in html window. how to download?

2: if do a copy paste technique from html page. where to copy paste?

You can use File->Save as in your browser to save the file or copy the text and paste it int a new file and save it as the file “iris.data” expected by the tutorial.

I hope that helps.

Jason.

This is a really simple but thorough explaination. Thanks for the efforts.

Could you suggest me how to draw a scatter plot for the 3 classes. It will be really great if you could upload the code. Thanks in advance!

What if we want to classify text into categories using KNN,

e.g a given paragraph of text defines {Politics,Sports,Technology}

I’m Working on a project to Classify RSS Feeds

How to download the file without using library csv at the first stage?