How to Calculate the 5-Number Summary for Your Data in Python

By Jason Brownlee on August 8, 2019 in Statistics 21

Data summarization provides a convenient way to describe all of the values in a data sample with just a few statistical values.

The mean and standard deviation are used to summarize data with a Gaussian distribution, but may not be meaningful, or could even be misleading, if your data sample has a non-Gaussian distribution.

In this tutorial, you will discover the five-number summary for describing the distribution of a data sample without assuming a specific data distribution.

After completing this tutorial, you will know:

Data summarization, such as calculating the mean and standard deviation, are only meaningful for the Gaussian distribution.
The five-number summary can be used to describe a data sample with any distribution.
How to calculate the five-number summary in Python.

Kick-start your project with my new book Statistics for Machine Learning, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

How to Calculate the 5-Number Summary for Your Data in Python
Photo by Masterbutler, some rights reserved.

Tutorial Overview

This tutorial is divided into 4 parts; they are:

Nonparametric Data Summarization
Five-Number Summary
How to Calculate the Five-Number Summary
Use of the Five-Number Summary

Need help with Statistics for Machine Learning?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Nonparametric Data Summarization

Data summarization techniques provide a way to describe the distribution of data using a few key measurements.

The most common example of data summarization is the calculation of the mean and standard deviation for data that has a Gaussian distribution. With these two parameters alone, you can understand and re-create the distribution of the data. The data summary can compress as few as tens or as many as millions individual observations.

The problem is, you cannot easily calculate the mean and standard deviation of data that does not have a Gaussian distribution. Technically, you can calculate these quantities, but they do not summarize the data distribution; in fact, they can be very misleading.

In the case of data that does not have a Gaussian distribution, you can summarize the data sample using the five-number summary.

Five-Number Summary

The five-number summary, or 5-number summary for short, is a non-parametric data summarization technique.

It is sometimes called the Tukey 5-number summary because it was recommended by John Tukey. It can be used to describe the distribution of data samples for data with any distribution.

As a standard summary for general use, the 5-number summary provides about the right amount of detail.

— Page 37, Understanding Robust and Exploratory Data Analysis, 2000.

The five-number summary involves the calculation of 5 summary statistical quantities: namely:

Median: The middle value in the sample, also called the 50th percentile or the 2nd quartile.
1st Quartile: The 25th percentile.
3rd Quartile: The 75th percentile.
Minimum: The smallest observation in the sample.
Maximum: The largest observation in the sample.

A quartile is an observed value at a point that aids in splitting the ordered data sample into four equally sized parts. The median, or 2nd Quartile, splits the ordered data sample into two parts, and the 1st and 3rd quartiles split each of those halves into quarters.

A percentile is an observed value at a point that aids in splitting the ordered data sample into 100 equally sized portions. Quartiles are often also expressed as percentiles.

Both the quartile and percentile values are examples of rank statistics that can be calculated on a data sample with any distribution. They are used to quickly summarize how much of the data in the distribution is behind or in front of a given observed value. For example, half of the observations are behind and in front of the median of a distribution.

Note that quartiles are also calculated in the box and whisker plot, a nonparametric method to graphically summarize the distribution of a data sample.

How to Calculate the Five-Number Summary

Calculating the five-number summary involves finding the observations for each quartile as well as the minimum and maximum observed values from the data sample.

If there is no specific value in the ordered data sample for the quartile, such as if there are an even number of observations and we are trying to find the median, then we can calculate the mean of the two closest values, such as the two middle values.

We can calculate arbitrary percentile values in Python using the percentile() NumPy function. We can use this function to calculate the 1st, 2nd (median), and 3rd quartile values. The function takes both an array of observations and a floating point value to specify the percentile to calculate in the range of 0 to 100. It can also takes a list of percentile values to calculate multiple percentiles; for example:

quartiles = percentile(data, [25, 50, 75])

1	quartiles = percentile(data, [25, 50, 75])

By default, the function will calculate a linear interpolation (average) between observations if needed, such as in the case of calculating the median on a sample with an even number of values.

The NumPy functions min() and max() can be used to return the smallest and largest values in the data sample; for example:

data_min, data_max = data.min(), data.max()

1	data_min, data_max = data.min(), data.max()

We can put all of this together.

The example below generates a data sample drawn from a uniform distribution between 0 and 1 and summarizes it using the five-number summary.

# calculate a 5-number summary
from numpy import percentile
from numpy.random import rand
# generate data sample
data = rand(1000)
# calculate quartiles
quartiles = percentile(data, [25, 50, 75])
# calculate min/max
data_min, data_max = data.min(), data.max()
# print 5-number summary
print('Min: %.3f' % data_min)
print('Q1: %.3f' % quartiles[0])
print('Median: %.3f' % quartiles[1])
print('Q3: %.3f' % quartiles[2])
print('Max: %.3f' % data_max)

# calculate a 5-number summary

from numpy import percentile

from numpy.random import rand

# generate data sample

data = rand(1000)

# calculate quartiles

quartiles = percentile(data, [25, 50, 75])

# calculate min/max

data_min, data_max = data.min(), data.max()

# print 5-number summary

print('Min: %.3f' % data_min)

print('Q1: %.3f' % quartiles[0])

print('Median: %.3f' % quartiles[1])

print('Q3: %.3f' % quartiles[2])

print('Max: %.3f' % data_max)

Running the example generates the data sample and calculates the five-number summary to describe the sample distribution.

We can see that the spread of observations is close to our expectations showing 0.27 for the 25th percentile 0.53 for the 50th percentile, and 0.76 for the 75th percentile, close to the idealized values of 0.25, 0.50, and 0.75 respectively.

Min: 0.000
Q1: 0.277
Median: 0.532
Q3: 0.766
Max: 1.000

Min: 0.000

Q1: 0.277

Median: 0.532

Q3: 0.766

Max: 1.000

Use of the Five-Number Summary

The five-number summary can be calculated for a data sample with any distribution.

This includes data that has a known distribution, such as a Gaussian or Gaussian-like distribution.

I would recommend always calculating the five-number summary, and only moving on to distribution specific summaries, such as mean and standard deviation for the Gaussian, in the case that you can identify the distribution to which the data belongs.

Extensions

This section lists some ideas for extending the tutorial that you may wish to explore.

Describe three examples in a machine learning project where a five-number summary could be calculated.
Generate a data sample with a Gaussian distribution and calculate the five-number summary.
Write a function to calculate a 5-number summary for any data sample.

If you explore any of these extensions, I’d love to know.

Summary

In this tutorial, you discovered the five-number summary for describing the distribution of a data sample without assuming a specific data distribution.

Specifically, you learned:

Data summarization, such as calculating the mean and standard deviation, are only meaningful for the Gaussian distribution.
The five-number summary can be used to describe a data sample with any distribution.
How to calculate the five-number summary in Python.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

21 Responses to How to Calculate the 5-Number Summary for Your Data in Python

£aique Merlin June 13, 2018 at 4:08 pm #

Great post as usual. Very important concepts to know before further exploring your data. Keep it up Jason!

Reply
- Jason Brownlee June 14, 2018 at 5:58 am #
  
  Thanks, I’m glad it helped.
  
  Reply
Aanchal Iyer June 14, 2018 at 11:12 pm #

Well written article, Python has the ability to manipulate some statistical data and calculate results of various statistical operations.

Reply
- Jason Brownlee June 15, 2018 at 6:44 am #
  
  Thanks.
  
  Reply
afees June 17, 2018 at 7:18 pm #

I am happy to get knowledge here

Reply
- Jason Brownlee June 18, 2018 at 6:41 am #
  
  I’m glad the material is helpful to you.
  
  Reply
Sailaja June 19, 2018 at 11:00 am #

I would like to know how to check( in Python) which distribution data has ( Gaussian or Non-Gaussian), Could you please provide example.

Thanks in Advance

Reply
- Jason Brownlee June 19, 2018 at 2:46 pm #
  
  Yes, see here:
  https://machinelearningmastery.com/a-gentle-introduction-to-normality-tests-in-python/
  
  Reply
Chris Pfeifer July 5, 2018 at 2:58 pm #

I purchased your pdf book (Statistical Methods for Machine Learning) – it is great, I am learning
lots. My question is: when summarizing a dataset how can I get the count (number of observation) in the 25th percentile.

Reply
- Jason Brownlee July 5, 2018 at 3:10 pm #
  
  Thanks Chris.
  
  You can use the numpy.percentile() function:
  https://docs.scipy.org/doc/numpy/reference/generated/numpy.percentile.html
  
  I give an example of using this function in the chapter on confidence intervals with the bootstrap.
  
  Reply
Nikos Vilanakis November 30, 2018 at 7:54 pm #

You’re doing a great job in here Jason! Thank you once again!

Reply
- Jason Brownlee December 1, 2018 at 6:47 am #
  
  Thanks.
  
  Reply
Sumedha Khatter March 2, 2020 at 11:20 am #

Thanks Jason for so useful and helpful tutorials.
Here is the line of code which I tried to find quartiles, and apparently we can use 0 and 100 to find the min and max of the data

quartiles = percentile(data, [0, 25, 50, 75, 100])

Reply
- Jason Brownlee March 2, 2020 at 1:16 pm #
  
  Nice.
  
  Reply
Muawiya September 2, 2020 at 2:36 am #

Hello Jason Brownlee,

Jason what if in the quantile I modified some and there were some outlier and skewness .However, the median is not there how can I get the median?

Sincerely,

Muawiya

Reply
- Jason Brownlee September 2, 2020 at 6:32 am #
  
  Perhaps calculate the median directly.
  
  Reply
Ken December 31, 2020 at 9:43 am #

I think numpy.percentile isn’t calculating the quartiles correctly as they are actually picking out the numbers from the list instead of calculating the actual values. Related link: https://stackoverflow.com/a/53551756

Reply
- Jason Brownlee December 31, 2020 at 9:52 am #
  
  Thanks for sharing, this goes against the documentation for the function.
  
  Reply
Søren Fyhn August 1, 2023 at 5:18 pm #

Hi Jason, one doubt about the Five Number Summary for data with a discrete variable.

I understand from your introduction to Probability that the mode would give the expected value in such case. Can I interpret it so that the 5 Number Summary for a discrete variable will be based on occurences for each discrete value?
So 50th quartile is the mode, Min is the value with the least amount of occurences, Max with the most and so on?

Reply
- James Carmichael August 2, 2023 at 9:15 am #
  
  Hi Soren…Please clarify what is meant by “can I interpret it…” This enable us to better assist you.
  
  Reply
  - Søren Fyhn August 13, 2023 at 7:03 pm #
    
    Hi James,
    
    Sure, basically I am asking how the 5 Number Summary should be calculated for discrete variables. The way I understand it is that, since the 50th quartile is the mode in such case, the other quartiles are calculated similarly.
    That is what I wanted you to confirm.
    
    Btw, I didn’t receive an email notification that a reply was made to my question. That would be a nice if that was the case.
    
    Reply

Navigation

How to Calculate the 5-Number Summary for Your Data in Python

Tutorial Overview

Need help with Statistics for Machine Learning?

Nonparametric Data Summarization

Five-Number Summary

How to Calculate the Five-Number Summary

Use of the Five-Number Summary

Extensions

Further Reading

Books

API

Articles

Summary

Get a Handle on Statistics for Machine Learning!

Develop a working understanding of statistics

Discover how to Transform Data into Knowledge

More On This Topic

21 Responses to How to Calculate the 5-Number Summary for Your Data in Python

Leave a Reply Click here to cancel reply.