How to Calculate the 5-Number Summary for Your Data in Python

Data summarization provides a convenient way to describe all of the values in a data sample with just a few statistical values.

The mean and standard deviation are used to summarize data with a Gaussian distribution, but may not be meaningful, or could even be misleading, if your data sample has a non-Gaussian distribution.

In this tutorial, you will discover the five-number summary for describing the distribution of a data sample without assuming a specific data distribution.

After completing this tutorial, you will know:

  • Data summarization, such as calculating the mean and standard deviation, are only meaningful for the Gaussian distribution.
  • The five-number summary can be used to describe a data sample with any distribution.
  • How to calculate the five-number summary in Python.

Kick-start your project with my new book Statistics for Machine Learning, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

How to Calculate the 5-Number Summary for Your Data in Python

How to Calculate the 5-Number Summary for Your Data in Python
Photo by Masterbutler, some rights reserved.

Tutorial Overview

This tutorial is divided into 4 parts; they are:

  1. Nonparametric Data Summarization
  2. Five-Number Summary
  3. How to Calculate the Five-Number Summary
  4. Use of the Five-Number Summary

Need help with Statistics for Machine Learning?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Nonparametric Data Summarization

Data summarization techniques provide a way to describe the distribution of data using a few key measurements.

The most common example of data summarization is the calculation of the mean and standard deviation for data that has a Gaussian distribution. With these two parameters alone, you can understand and re-create the distribution of the data. The data summary can compress as few as tens or as many as millions individual observations.

The problem is, you cannot easily calculate the mean and standard deviation of data that does not have a Gaussian distribution. Technically, you can calculate these quantities, but they do not summarize the data distribution; in fact, they can be very misleading.

In the case of data that does not have a Gaussian distribution, you can summarize the data sample using the five-number summary.

Five-Number Summary

The five-number summary, or 5-number summary for short, is a non-parametric data summarization technique.

It is sometimes called the Tukey 5-number summary because it was recommended by John Tukey. It can be used to describe the distribution of data samples for data with any distribution.

As a standard summary for general use, the 5-number summary provides about the right amount of detail.

— Page 37, Understanding Robust and Exploratory Data Analysis, 2000.

The five-number summary involves the calculation of 5 summary statistical quantities: namely:

  • Median: The middle value in the sample, also called the 50th percentile or the 2nd quartile.
  • 1st Quartile: The 25th percentile.
  • 3rd Quartile: The 75th percentile.
  • Minimum: The smallest observation in the sample.
  • Maximum: The largest observation in the sample.

A quartile is an observed value at a point that aids in splitting the ordered data sample into four equally sized parts. The median, or 2nd Quartile, splits the ordered data sample into two parts, and the 1st and 3rd quartiles split each of those halves into quarters.

A percentile is an observed value at a point that aids in splitting the ordered data sample into 100 equally sized portions. Quartiles are often also expressed as percentiles.

Both the quartile and percentile values are examples of rank statistics that can be calculated on a data sample with any distribution. They are used to quickly summarize how much of the data in the distribution is behind or in front of a given observed value. For example, half of the observations are behind and in front of the median of a distribution.

Note that quartiles are also calculated in the box and whisker plot, a nonparametric method to graphically summarize the distribution of a data sample.

How to Calculate the Five-Number Summary

Calculating the five-number summary involves finding the observations for each quartile as well as the minimum and maximum observed values from the data sample.

If there is no specific value in the ordered data sample for the quartile, such as if there are an even number of observations and we are trying to find the median, then we can calculate the mean of the two closest values, such as the two middle values.

We can calculate arbitrary percentile values in Python using the percentile() NumPy function. We can use this function to calculate the 1st, 2nd (median), and 3rd quartile values. The function takes both an array of observations and a floating point value to specify the percentile to calculate in the range of 0 to 100. It can also takes a list of percentile values to calculate multiple percentiles; for example:

By default, the function will calculate a linear interpolation (average) between observations if needed, such as in the case of calculating the median on a sample with an even number of values.

The NumPy functions min() and max() can be used to return the smallest and largest values in the data sample; for example:

We can put all of this together.

The example below generates a data sample drawn from a uniform distribution between 0 and 1 and summarizes it using the five-number summary.

Running the example generates the data sample and calculates the five-number summary to describe the sample distribution.

We can see that the spread of observations is close to our expectations showing 0.27 for the 25th percentile 0.53 for the 50th percentile, and 0.76 for the 75th percentile, close to the idealized values of 0.25, 0.50, and 0.75 respectively.

Use of the Five-Number Summary

The five-number summary can be calculated for a data sample with any distribution.

This includes data that has a known distribution, such as a Gaussian or Gaussian-like distribution.

I would recommend always calculating the five-number summary, and only moving on to distribution specific summaries, such as mean and standard deviation for the Gaussian, in the case that you can identify the distribution to which the data belongs.

Extensions

This section lists some ideas for extending the tutorial that you may wish to explore.

  • Describe three examples in a machine learning project where a five-number summary could be calculated.
  • Generate a data sample with a Gaussian distribution and calculate the five-number summary.
  • Write a function to calculate a 5-number summary for any data sample.

If you explore any of these extensions, I’d love to know.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Books

API

Articles

Summary

In this tutorial, you discovered the five-number summary for describing the distribution of a data sample without assuming a specific data distribution.

Specifically, you learned:

  • Data summarization, such as calculating the mean and standard deviation, are only meaningful for the Gaussian distribution.
  • The five-number summary can be used to describe a data sample with any distribution.
  • How to calculate the five-number summary in Python.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Get a Handle on Statistics for Machine Learning!

Statistical Methods for Machine Learning

Develop a working understanding of statistics

...by writing lines of code in python

Discover how in my new Ebook:
Statistical Methods for Machine Learning

It provides self-study tutorials on topics like:
Hypothesis Tests, Correlation, Nonparametric Stats, Resampling, and much more...

Discover how to Transform Data into Knowledge

Skip the Academics. Just Results.

See What's Inside

21 Responses to How to Calculate the 5-Number Summary for Your Data in Python

  1. Avatar
    £aique Merlin June 13, 2018 at 4:08 pm #

    Great post as usual. Very important concepts to know before further exploring your data. Keep it up Jason!

  2. Avatar
    Aanchal Iyer June 14, 2018 at 11:12 pm #

    Well written article, Python has the ability to manipulate some statistical data and calculate results of various statistical operations.

  3. Avatar
    afees June 17, 2018 at 7:18 pm #

    I am happy to get knowledge here

  4. Avatar
    Sailaja June 19, 2018 at 11:00 am #

    I would like to know how to check( in Python) which distribution data has ( Gaussian or Non-Gaussian), Could you please provide example.

    Thanks in Advance

  5. Avatar
    Chris Pfeifer July 5, 2018 at 2:58 pm #

    I purchased your pdf book (Statistical Methods for Machine Learning) – it is great, I am learning
    lots. My question is: when summarizing a dataset how can I get the count (number of observation) in the 25th percentile.

  6. Avatar
    Nikos Vilanakis November 30, 2018 at 7:54 pm #

    You’re doing a great job in here Jason! Thank you once again!

  7. Avatar
    Sumedha Khatter March 2, 2020 at 11:20 am #

    Thanks Jason for so useful and helpful tutorials.
    Here is the line of code which I tried to find quartiles, and apparently we can use 0 and 100 to find the min and max of the data

    quartiles = percentile(data, [0, 25, 50, 75, 100])

  8. Avatar
    Muawiya September 2, 2020 at 2:36 am #

    Hello Jason Brownlee,

    Jason what if in the quantile I modified some and there were some outlier and skewness .However, the median is not there how can I get the median?

    Sincerely,

    Muawiya

  9. Avatar
    Ken December 31, 2020 at 9:43 am #

    I think numpy.percentile isn’t calculating the quartiles correctly as they are actually picking out the numbers from the list instead of calculating the actual values. Related link: https://stackoverflow.com/a/53551756

    • Avatar
      Jason Brownlee December 31, 2020 at 9:52 am #

      Thanks for sharing, this goes against the documentation for the function.

  10. Avatar
    Søren Fyhn August 1, 2023 at 5:18 pm #

    Hi Jason, one doubt about the Five Number Summary for data with a discrete variable.

    I understand from your introduction to Probability that the mode would give the expected value in such case. Can I interpret it so that the 5 Number Summary for a discrete variable will be based on occurences for each discrete value?
    So 50th quartile is the mode, Min is the value with the least amount of occurences, Max with the most and so on?

    • Avatar
      James Carmichael August 2, 2023 at 9:15 am #

      Hi Soren…Please clarify what is meant by “can I interpret it…” This enable us to better assist you.

      • Avatar
        Søren Fyhn August 13, 2023 at 7:03 pm #

        Hi James,

        Sure, basically I am asking how the 5 Number Summary should be calculated for discrete variables. The way I understand it is that, since the 50th quartile is the mode in such case, the other quartiles are calculated similarly.
        That is what I wanted you to confirm.

        Btw, I didn’t receive an email notification that a reply was made to my question. That would be a nice if that was the case.

Leave a Reply