4 Distance Measures for Machine Learning

By Jason Brownlee on August 19, 2020 in Python Machine Learning 40

Distance measures play an important role in machine learning.

They provide the foundation for many popular and effective machine learning algorithms like k-nearest neighbors for supervised learning and k-means clustering for unsupervised learning.

Different distance measures must be chosen and used depending on the types of the data. As such, it is important to know how to implement and calculate a range of different popular distance measures and the intuitions for the resulting scores.

In this tutorial, you will discover distance measures in machine learning.

After completing this tutorial, you will know:

The role and importance of distance measures in machine learning algorithms.
How to implement and calculate Hamming, Euclidean, and Manhattan distance measures.
How to implement and calculate the Minkowski distance that generalizes the Euclidean and Manhattan distance measures.

Kick-start your project with my new book Machine Learning Mastery With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Distance Measures for Machine Learning
Photo by Prince Roy, some rights reserved.

Tutorial Overview

This tutorial is divided into five parts; they are:

Role of Distance Measures
Hamming Distance
Euclidean Distance
Manhattan Distance (Taxicab or City Block)
Minkowski Distance

Role of Distance Measures

Distance measures play an important role in machine learning.

A distance measure is an objective score that summarizes the relative difference between two objects in a problem domain.

Most commonly, the two objects are rows of data that describe a subject (such as a person, car, or house), or an event (such as a purchase, a claim, or a diagnosis).

Perhaps the most likely way you will encounter distance measures is when you are using a specific machine learning algorithm that uses distance measures at its core. The most famous algorithm of this type is the k-nearest neighbors algorithm, or KNN for short.

In the KNN algorithm, a classification or regression prediction is made for new examples by calculating the distance between the new example (row) and all examples (rows) in the training dataset. The k examples in the training dataset with the smallest distance are then selected and a prediction is made by averaging the outcome (mode of the class label or mean of the real value for regression).

KNN belongs to a broader field of algorithms called case-based or instance-based learning, most of which use distance measures in a similar manner. Another popular instance-based algorithm that uses distance measures is the learning vector quantization, or LVQ, algorithm that may also be considered a type of neural network.

Related is the self-organizing map algorithm, or SOM, that also uses distance measures and can be used for supervised or unsupervised learning. Another unsupervised learning algorithm that uses distance measures at its core is the K-means clustering algorithm.

In instance-based learning the training examples are stored verbatim, and a distance function is used to determine which member of the training set is closest to an unknown test instance. Once the nearest training instance has been located, its class is predicted for the test instance.

— Page 135, Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, 2016.

A short list of some of the more popular machine learning algorithms that use distance measures at their core is as follows:

K-Nearest Neighbors
Learning Vector Quantization (LVQ)
Self-Organizing Map (SOM)
K-Means Clustering

There are many kernel-based methods may also be considered distance-based algorithms. Perhaps the most widely known kernel method is the support vector machine algorithm, or SVM for short.

Do you know more algorithms that use distance measures?
Let me know in the comments below.

When calculating the distance between two examples or rows of data, it is possible that different data types are used for different columns of the examples. An example might have real values, boolean values, categorical values, and ordinal values. Different distance measures may be required for each that are summed together into a single distance score.

Numerical values may have different scales. This can greatly impact the calculation of distance measure and it is often a good practice to normalize or standardize numerical values prior to calculating the distance measure.

Numerical error in regression problems may also be considered a distance. For example, the error between the expected value and the predicted value is a one-dimensional distance measure that can be summed or averaged over all examples in a test set to give a total distance between the expected and predicted outcomes in the dataset. The calculation of the error, such as the mean squared error or mean absolute error, may resemble a standard distance measure.

As we can see, distance measures play an important role in machine learning. Perhaps four of the most commonly used distance measures in machine learning are as follows:

Hamming Distance
Euclidean Distance
Manhattan Distance
Minkowski Distance

What are some other distance measures you have used or heard of?
Let me know in the comments below.

You need to know how to calculate each of these distance measures when implementing algorithms from scratch and the intuition for what is being calculated when using algorithms that make use of these distance measures.

Let’s take a closer look at each in turn.

Hamming Distance

Hamming distance calculates the distance between two binary vectors, also referred to as binary strings or bitstrings for short.

You are most likely going to encounter bitstrings when you one-hot encode categorical columns of data.

For example, if a column had the categories ‘red,’ ‘green,’ and ‘blue,’ you might one hot encode each example as a bitstring with one bit for each column.

red = [1, 0, 0]
green = [0, 1, 0]
blue = [0, 0, 1]

The distance between red and green could be calculated as the sum or the average number of bit differences between the two bitstrings. This is the Hamming distance.

For a one-hot encoded string, it might make more sense to summarize to the sum of the bit differences between the strings, which will always be a 0 or 1.

HammingDistance = sum for i to N abs(v1[i] – v2[i])

For bitstrings that may have many 1 bits, it is more common to calculate the average number of bit differences to give a hamming distance score between 0 (identical) and 1 (all different).

HammingDistance = (sum for i to N abs(v1[i] – v2[i])) / N

We can demonstrate this with an example of calculating the Hamming distance between two bitstrings, listed below.

# calculating hamming distance between bit strings

# calculate hamming distance
def hamming_distance(a, b):
	return sum(abs(e1 - e2) for e1, e2 in zip(a, b)) / len(a)

# define data
row1 = [0, 0, 0, 0, 0, 1]
row2 = [0, 0, 0, 0, 1, 0]
# calculate distance
dist = hamming_distance(row1, row2)
print(dist)

# calculating hamming distance between bit strings

# calculate hamming distance

def hamming_distance(a, b):

return sum(abs(e1 - e2) for e1, e2 in zip(a, b)) / len(a)

# define data

row1 = [0, 0, 0, 0, 0, 1]

row2 = [0, 0, 0, 0, 1, 0]

# calculate distance

dist = hamming_distance(row1, row2)

print(dist)

Running the example reports the Hamming distance between the two bitstrings.

We can see that there are two differences between the strings, or 2 out of 6 bit positions different, which averaged (2/6) is about 1/3 or 0.333.

0.3333333333333333

1	0.3333333333333333

We can also perform the same calculation using the hamming() function from SciPy. The complete example is listed below.

# calculating hamming distance between bit strings
from scipy.spatial.distance import hamming
# define data
row1 = [0, 0, 0, 0, 0, 1]
row2 = [0, 0, 0, 0, 1, 0]
# calculate distance
dist = hamming(row1, row2)
print(dist)

# calculating hamming distance between bit strings

from scipy.spatial.distance import hamming

# define data

row1 = [0, 0, 0, 0, 0, 1]

row2 = [0, 0, 0, 0, 1, 0]

# calculate distance

dist = hamming(row1, row2)

print(dist)

Running the example, we can see we get the same result, confirming our manual implementation.

0.3333333333333333

1	0.3333333333333333

Euclidean Distance

Euclidean distance calculates the distance between two real-valued vectors.

You are most likely to use Euclidean distance when calculating the distance between two rows of data that have numerical values, such a floating point or integer values.

If columns have values with differing scales, it is common to normalize or standardize the numerical values across all columns prior to calculating the Euclidean distance. Otherwise, columns that have large values will dominate the distance measure.

Although there are other possible choices, most instance-based learners use Euclidean distance.

— Page 135, Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, 2016.

Euclidean distance is calculated as the square root of the sum of the squared differences between the two vectors.

EuclideanDistance = sqrt(sum for i to N (v1[i] – v2[i])^2)

If the distance calculation is to be performed thousands or millions of times, it is common to remove the square root operation in an effort to speed up the calculation. The resulting scores will have the same relative proportions after this modification and can still be used effectively within a machine learning algorithm for finding the most similar examples.

EuclideanDistance = sum for i to N (v1[i] – v2[i])^2

This calculation is related to the L2 vector norm and is equivalent to the sum squared error and the root sum squared error if the square root is added.

We can demonstrate this with an example of calculating the Euclidean distance between two real-valued vectors, listed below.

# calculating euclidean distance between vectors
from math import sqrt

# calculate euclidean distance
def euclidean_distance(a, b):
	return sqrt(sum((e1-e2)**2 for e1, e2 in zip(a,b)))

# define data
row1 = [10, 20, 15, 10, 5]
row2 = [12, 24, 18, 8, 7]
# calculate distance
dist = euclidean_distance(row1, row2)
print(dist)

# calculating euclidean distance between vectors

from math import sqrt

# calculate euclidean distance

def euclidean_distance(a, b):

return sqrt(sum((e1-e2)**2 for e1, e2 in zip(a,b)))

# define data

row1 = [10, 20, 15, 10, 5]

row2 = [12, 24, 18, 8, 7]

# calculate distance

dist = euclidean_distance(row1, row2)

print(dist)

Running the example reports the Euclidean distance between the two vectors.

6.082762530298219

1	6.082762530298219

We can also perform the same calculation using the euclidean() function from SciPy. The complete example is listed below.

# calculating euclidean distance between vectors
from scipy.spatial.distance import euclidean
# define data
row1 = [10, 20, 15, 10, 5]
row2 = [12, 24, 18, 8, 7]
# calculate distance
dist = euclidean(row1, row2)
print(dist)

# calculating euclidean distance between vectors

from scipy.spatial.distance import euclidean

# define data

row1 = [10, 20, 15, 10, 5]

row2 = [12, 24, 18, 8, 7]

# calculate distance

dist = euclidean(row1, row2)

print(dist)

Running the example, we can see we get the same result, confirming our manual implementation.

6.082762530298219

1	6.082762530298219

Manhattan Distance (Taxicab or City Block Distance)

The Manhattan distance, also called the Taxicab distance or the City Block distance, calculates the distance between two real-valued vectors.

It is perhaps more useful to vectors that describe objects on a uniform grid, like a chessboard or city blocks. The taxicab name for the measure refers to the intuition for what the measure calculates: the shortest path that a taxicab would take between city blocks (coordinates on the grid).

It might make sense to calculate Manhattan distance instead of Euclidean distance for two vectors in an integer feature space.

Manhattan distance is calculated as the sum of the absolute differences between the two vectors.

ManhattanDistance = sum for i to N sum |v1[i] – v2[i]|

The Manhattan distance is related to the L1 vector norm and the sum absolute error and mean absolute error metric.

We can demonstrate this with an example of calculating the Manhattan distance between two integer vectors, listed below.

# calculating manhattan distance between vectors
from math import sqrt

# calculate manhattan distance
def manhattan_distance(a, b):
	return sum(abs(e1-e2) for e1, e2 in zip(a,b))

# define data
row1 = [10, 20, 15, 10, 5]
row2 = [12, 24, 18, 8, 7]
# calculate distance
dist = manhattan_distance(row1, row2)
print(dist)

# calculating manhattan distance between vectors

from math import sqrt

# calculate manhattan distance

def manhattan_distance(a, b):

return sum(abs(e1-e2) for e1, e2 in zip(a,b))

# define data

row1 = [10, 20, 15, 10, 5]

row2 = [12, 24, 18, 8, 7]

# calculate distance

dist = manhattan_distance(row1, row2)

print(dist)

Running the example reports the Manhattan distance between the two vectors.

13

We can also perform the same calculation using the cityblock() function from SciPy. The complete example is listed below.

# calculating manhattan distance between vectors
from scipy.spatial.distance import cityblock
# define data
row1 = [10, 20, 15, 10, 5]
row2 = [12, 24, 18, 8, 7]
# calculate distance
dist = cityblock(row1, row2)
print(dist)

# calculating manhattan distance between vectors

from scipy.spatial.distance import cityblock

# define data

row1 = [10, 20, 15, 10, 5]

row2 = [12, 24, 18, 8, 7]

# calculate distance

dist = cityblock(row1, row2)

print(dist)

Running the example, we can see we get the same result, confirming our manual implementation.

13

Minkowski Distance

Minkowski distance calculates the distance between two real-valued vectors.

It is a generalization of the Euclidean and Manhattan distance measures and adds a parameter, called the “order” or “p“, that allows different distance measures to be calculated.

The Minkowski distance measure is calculated as follows:

EuclideanDistance = (sum for i to N (abs(v1[i] – v2[i]))^p)^(1/p)

Where “p” is the order parameter.

When p is set to 1, the calculation is the same as the Manhattan distance. When p is set to 2, it is the same as the Euclidean distance.

p=1: Manhattan distance.
p=2: Euclidean distance.

Intermediate values provide a controlled balance between the two measures.

It is common to use Minkowski distance when implementing a machine learning algorithm that uses distance measures as it gives control over the type of distance measure used for real-valued vectors via a hyperparameter “p” that can be tuned.

We can demonstrate this calculation with an example of calculating the Minkowski distance between two real vectors, listed below.

# calculating minkowski distance between vectors
from math import sqrt

# calculate minkowski distance
def minkowski_distance(a, b, p):
	return sum(abs(e1-e2)**p for e1, e2 in zip(a,b))**(1/p)

# define data
row1 = [10, 20, 15, 10, 5]
row2 = [12, 24, 18, 8, 7]
# calculate distance (p=1)
dist = minkowski_distance(row1, row2, 1)
print(dist)
# calculate distance (p=2)
dist = minkowski_distance(row1, row2, 2)
print(dist)

# calculating minkowski distance between vectors

from math import sqrt

# calculate minkowski distance

def minkowski_distance(a, b, p):

return sum(abs(e1-e2)**p for e1, e2 in zip(a,b))**(1/p)

# define data

row1 = [10, 20, 15, 10, 5]

row2 = [12, 24, 18, 8, 7]

# calculate distance (p=1)

dist = minkowski_distance(row1, row2, 1)

print(dist)

# calculate distance (p=2)

dist = minkowski_distance(row1, row2, 2)

print(dist)

Running the example first calculates and prints the Minkowski distance with p set to 1 to give the Manhattan distance, then with p set to 2 to give the Euclidean distance, matching the values calculated on the same data from the previous sections.

13.0
6.082762530298219

1 2	13.0 6.082762530298219

We can also perform the same calculation using the minkowski_distance() function from SciPy. The complete example is listed below.

# calculating minkowski distance between vectors
from scipy.spatial import minkowski_distance
# define data
row1 = [10, 20, 15, 10, 5]
row2 = [12, 24, 18, 8, 7]
# calculate distance (p=1)
dist = minkowski_distance(row1, row2, 1)
print(dist)
# calculate distance (p=2)
dist = minkowski_distance(row1, row2, 2)
print(dist)

# calculating minkowski distance between vectors

from scipy.spatial import minkowski_distance

# define data

row1 = [10, 20, 15, 10, 5]

row2 = [12, 24, 18, 8, 7]

# calculate distance (p=1)

dist = minkowski_distance(row1, row2, 1)

print(dist)

# calculate distance (p=2)

dist = minkowski_distance(row1, row2, 2)

print(dist)

Running the example, we can see we get the same results, confirming our manual implementation.

13.0
6.082762530298219

1 2	13.0 6.082762530298219

Summary

In this tutorial, you discovered distance measures in machine learning.

Specifically, you learned:

The role and importance of distance measures in machine learning algorithms.
How to implement and calculate Hamming, Euclidean, and Manhattan distance measures.
How to implement and calculate the Minkowski distance that generalizes the Euclidean and Manhattan distance measures.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

40 Responses to 4 Distance Measures for Machine Learning

Martin March 26, 2020 at 5:47 pm #

Why didn’t you write about Mahalanobis distance? It’s much better than Euclidean, if we consider different measure scales of variables and correlations between them.

Reply
- Mel March 27, 2020 at 5:38 am #
  
  Upvote for covering Mahalanobis distance!
  
  Reply
  - Jason Brownlee March 27, 2020 at 6:21 am #
    
    Thanks Mel!
    
    Reply
    - Subrata Manna June 11, 2022 at 4:20 pm #
      
      Can you suggest me the books for calculating the distances for data science
      
      Reply
      - James Carmichael June 12, 2022 at 9:27 am #
        
        Hi Subrata…The following resource may be of interest:
        
        https://towardsdatascience.com/9-distance-measures-in-data-science-918109d069fa
  - Dawn September 20, 2020 at 11:35 am #
    
    Another upvote for Mahalanobis
    
    Reply
    - Jason Brownlee September 20, 2020 at 1:34 pm #
      
      Thanks, it’s on my TODO.
      
      Reply
- Jason Brownlee March 27, 2020 at 6:04 am #
  
  Great suggestion, thanks!
  
  Reply
Furkan BAĞCI March 30, 2020 at 4:52 am #

Agree with the comment above. Furthermore, the difference between mahalanobis and eucliden distance metric could be explained by using unsupervised support vector clustering algorithm that uses euclidean distance and unsupervised ellipsoidal support vector clustering algorithm that uses mahalanobis distance metric.

Reply
- Jason Brownlee March 30, 2020 at 5:39 am #
  
  Thanks.
  
  Reply
Joe April 2, 2020 at 10:18 am #

No love for vector cosine similarity?

Reply
- Jason Brownlee April 2, 2020 at 1:30 pm #
  
  Great suggestion, thanks.
  
  Reply
ming April 5, 2020 at 6:52 pm #

How about EMD distance?

Reply
- Jason Brownlee April 6, 2020 at 6:03 am #
  
  What is EMD distance?
  
  Reply
  - ming April 18, 2020 at 2:30 pm #
    
    Earth Mover’s Distance
    
    Reply
    - Jason Brownlee April 19, 2020 at 5:50 am #
      
      Thanks.
      
      Reply
Grzegorz Kępisty May 22, 2020 at 4:53 pm #

It is worth mention that in some advance cases the default metric option are not enough (for example metric options available for KNN in sklearn). I am working currently on the project in which KNN distance is defined using both categorical columns ( having various distance weight in case of value difference ) and numerical columns (having distance proportional to absolute value difference). Final distance is a sum of distances over columns.
Don’t be afraid of custom metrics!
Regards!

Reply
- Jason Brownlee May 23, 2020 at 6:13 am #
  
  Great tip!
  
  Reply
Anshul Verma July 17, 2020 at 4:10 am #

Whats the difference between , similarity and distance ?

Also , difference between :
1 Cosine distance and Euclidean distance ?
2 Cosine similarity and Euclidean similarity ?

Reply
- Jason Brownlee July 17, 2020 at 6:25 am #
  
  Not a lot, in this context they mean the same thing.
  
  Reply
Blacky December 7, 2020 at 5:10 pm #

Hi, im still learning bout this distance measurement. can i ask you a question sir?

how did the rows data in euclidean work and how to obtain the data? is it a random numerical value?

i hope this question didnt too much for you sir. thank you

Reply
- Jason Brownlee December 8, 2020 at 7:40 am #
  
  You would collect data from your domain, each row of data would be one observation.
  
  Reply
  - Blacky December 8, 2020 at 12:33 pm #
    
    in my case, im doing a project to measure the similarity for images. so can i used the coordinates of the image as my data?
    
    Reply
    - Jason Brownlee December 8, 2020 at 1:32 pm #
      
      I believe there are specific measures used for comparing the similarity between images (matrix of pixels). I recommend checking the literature.
      
      Reply
  - Sumanth June 22, 2021 at 7:31 pm #
    
    Cosine model in distance algorithm is missing, update it with cosine model that is 5 the model of distance measure
    
    Reply
    - Jason Brownlee June 23, 2021 at 5:36 am #
      
      Thanks for the suggestion.
      
      Reply
Tim December 11, 2020 at 12:17 pm #

New to Distance Measuring; For an unsupervised learning K-Clustering Analysis is there a preferred method. My variables relate to shopping and trying to identify groups of customers with same shopping habits, i have customer information (age, income, education level) and products they purchase. Thanks.

Reply
- Jason Brownlee December 11, 2020 at 1:31 pm #
  
  Yes, there are specific metrics for clustering:
  https://machinelearningmastery.com/faq/single-faq/how-do-i-evaluate-a-clustering-algorithm
  
  Reply
Kidist June 10, 2021 at 8:23 pm #

Cosin Similarity Distance

Reply
- Jason Brownlee June 11, 2021 at 5:15 am #
  
  Great suggestion!
  
  Reply
Zeenat Fatima Lina August 25, 2021 at 2:26 am #

I love the subject very much. Friday i has a exam on clustering, ACO, perceptron. I can every math and i love to do the math. I have a wish to work with Data science in PhD.

Reply
ghizlane April 11, 2022 at 9:56 am #

Hi Jason, thank for this post
In the same context, i have tuned the distance metric of KKNN algorithm in caret and it gives me three values 1, 2 and 3 using the random search. What does this mean?

Reply
- James Carmichael April 14, 2022 at 3:40 am #
  
  Hi Ghizlane…You may find the following of interest:
  
  https://www.analyticsvidhya.com/blog/2021/08/how-knn-uses-distance-measures/
  
  Reply
Ghizlane April 17, 2022 at 9:50 am #

Thanks James for the reply, I will check the link

Reply
Dani August 29, 2022 at 4:23 pm #

Hi Jason very nice post, does PCA can be considered a measure based algorithm? When the projection is computed it uses a distance measure at its core, wright?

Reply
- James Carmichael August 30, 2022 at 6:40 am #
  
  Hi Dani…You may find the following of interest:
  
  https://www.keboola.com/blog/pca-machine-learning
  
  Reply
Renato October 24, 2023 at 2:00 am #

Many thanks for the contribution.
Please could you tell me how to initialize the K-Means process with Manhattan distance method?
Same as Euclidean distance? Choice of Cluster, Centroids calculation, iteration..etc?
Thank you

Reply
- James Carmichael October 24, 2023 at 10:10 am #
  
  You are very welcome Renato! The following resource may be of interest:
  
  https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-k-means-clustering/
  
  Reply
farid November 14, 2023 at 6:51 am #

Hello
can you help me about a better way than wesserstain statistical distance for metric in machin learning?

Reply
- James Carmichael November 14, 2023 at 10:23 am #
  
  Hi farid…Please elaborate on the criteria you consider when comparing algorithms. In other words, please elaborate on the issue or concern you have regarding Wasserstein.
  
  Alternatives can be found here:
  
  https://towardsdatascience.com/statistical-tests-wont-help-you-to-compare-distributions-d829eefe418
  
  Reply

Navigation

4 Distance Measures for Machine Learning

Tutorial Overview

Role of Distance Measures

Hamming Distance

Euclidean Distance

Manhattan Distance (Taxicab or City Block Distance)

Minkowski Distance

Further Reading

Books

APIs

Articles

Summary

Discover Fast Machine Learning in Python!

Develop Your Own Models in Minutes

Finally Bring Machine Learning To
Your Own Projects

More On This Topic

40 Responses to 4 Distance Measures for Machine Learning

Leave a Reply Click here to cancel reply.

Navigation

Tutorial Overview

Role of Distance Measures

Hamming Distance

Euclidean Distance

Manhattan Distance (Taxicab or City Block Distance)

Minkowski Distance

Further Reading

Books

APIs

Articles

Summary

Discover Fast Machine Learning in Python!

Develop Your Own Models in Minutes

Finally Bring Machine Learning To Your Own Projects

More On This Topic

40 Responses to 4 Distance Measures for Machine Learning

Leave a Reply Click here to cancel reply.

Finally Bring Machine Learning To
Your Own Projects