What is Statistics (and why is it important in machine learning)?

Statistics is a collection of tools that you can use to get answers to important questions about data.

You can use descriptive statistical methods to transform raw observations into information that you can understand and share. You can use inferential statistical methods to reason from small samples of data to whole domains.

In this post, you will discover clearly why statistics is important in general and for machine learning and generally the types of methods that are available.

After reading this post, you will know:

  • Statistics is generally considered a prerequisite to the field of applied machine learning.
  • We need statistics to help transform observations into information and to answer questions about samples of observations.
  • Statistics is a collection of tools developed over hundreds of years for summarizing data and quantifying properties of a domain given a sample of observations.

Kick-start your project with my new book Statistics for Machine Learning, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

A Gentle Introduction to Statistics

A Gentle Introduction to Statistics
Photo by Mike Sutherland, some rights reserved.

Statistics is Required Prerequisite

Machine learning and statistics are two tightly related fields of study. So much so that statisticians refer to machine learning as “applied statistics” or “statistical learning” rather than the computer-science-centric name.

Machine learning is almost universally presented to beginners assuming that the reader has some background in statistics. We can make this concrete with a few cherry picked examples.

Take a look at this quote from the beginning of a popular applied machine learning book titled “Applied Predictive Modeling“:

… the reader should have some knowledge of basic statistics, including variance, correlation, simple linear regression, and basic hypothesis testing (e.g. p-values and test statistics).

— Page vii, Applied Predictive Modeling, 2013

Here’s another example from the popular “Introduction to Statistical Learning” book:

We expect that the reader will have had at least one elementary course in statistics.

— Page 9, An Introduction to Statistical Learning with Applications in R, 2013.

Even when statistics is not a prerequisite, some primitive prior knowledge is required as can be seen in this quote from the widely read “Programming Collective Intelligence“:

… this book does not assume you have any prior knowledge of […] or statistics. […] but having some knowledge of trigonometry and basic statistics will help you understand the algorithms.

— Page xiii, Programming Collective Intelligence: Building Smart Web 2.0 Applications, 2007.

In order to be able to understand machine learning, some basic understanding of statistics is required.

To see why this is the case, we must first understand why we need the field of statistics in the first place.

Need help with Statistics for Machine Learning?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Why Learn Statistics?

Raw observations alone are data, but they are not information or knowledge.

Data raises questions, such as:

  • What is the most common or expected observation?
  • What are the limits on the observations?
  • What does the data look like?

Although they appear simple, these questions must be answered in order to turn raw observations into information that we can use and share.

Beyond raw data, we may design experiments in order to collect observations. From these experimental results we may have more sophisticated questions, such as:

  • What variables are most relevant?
  • What is the difference in an outcome between two experiments?
  • Are the differences real or the result of noise in the data?

Questions of this type are important. The results matter to the project, to stakeholders, and to effective decision making.

Statistical methods are required to find answers to the questions that we have about data.

We can see that in order to both understand the data used to train a machine learning model and to interpret the results of testing different machine learning models, that statistical methods are required.

This is just the tip of the iceberg as each step in a predictive modeling project will require the use of a statistical method.

What is Statistics?

Statistics is a subfield of mathematics.

It refers to a collection of methods for working with data and using data to answer questions.

Statistics is the art of making numerical conjectures about puzzling questions. […] The methods were developed over several hundred years by people who were looking for answers to their questions.

— Page xiii, Statistics, Fourth Edition, 2007.

It is because the field is comprised of a grab bag of methods for working with data that it can seem large and amorphous to beginners. It can be hard to see the line between methods that belong to statistics and methods that belong to other fields of study. Often a technique can be both a classical method from statistics and a modern algorithm used for feature selection or modeling.

Although a working knowledge of statistics does not require deep theoretical knowledge, some important and easy-to-digest theorems from the relationship between statistics and probability can provide a valuable foundation.

Two examples include the law of large numbers and the central limit theorem; the first aids in understanding why bigger samples are often better and the second provides a foundation for how we can compare the expected values between samples (e.g mean values).

When it comes to the statistical tools that we use in practice, it can be helpful to divide the field of statistics into two large groups of methods: descriptive statistics for summarizing data and inferential statistics for drawing conclusions from samples of data.

Statistics allow researchers to collect information, or data, from a large number of people and then summarize their typical experience. […] Statistics are also used to reach conclusions about general differences between groups. […] Statistics can also be used to see if scores on two variables are related and to make predictions.

Pages ix-x, Statistics in Plain English, Third Edition, 2010.

Descriptive Statistics

Descriptive statistics refer to methods for summarizing raw observations into information that we can understand and share.

Commonly, we think of descriptive statistics as the calculation of statistical values on samples of data in order to summarize properties of the sample of data, such as the common expected value (e.g. the mean or median) and the spread of the data (e.g. the variance or standard deviation).

Descriptive statistics may also cover graphical methods that can be used to visualize samples of data. Charts and graphics can provide a useful qualitative understanding of both the shape or distribution of observations as well as how variables may relate to each other.

Inferential Statistics

Inferential statistics is a fancy name for methods that aid in quantifying properties of the domain or population from a smaller set of obtained observations called a sample.

Commonly, we think of inferential statistics as the estimation of quantities from the population distribution, such as the expected value or the amount of spread.

More sophisticated statistical inference tools can be used to quantify the likelihood of observing data samples given an assumption. These are often referred to as tools for statistical hypothesis testing, where the base assumption of a test is called the null hypothesis.

There are many examples of inferential statistical methods given the range of hypothesises we may assume and the constraints we may impose on the data in order to increase the power or likelihood that the finding of the test is correct.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Books

Articles

Summary

In this post, you discovered clearly why statistics is important in general and for machine learning, and generally the types of methods that are available.

Specifically, you learned:

  • Statistics is generally considered a prerequisite to the field of applied machine learning.
  • We need statistics to help transform observations into information and to answer questions about samples of observations.
  • Statistics is a collection of tools developed over hundreds of years for summarizing data and quantifying properties of a domain given a sample of observations.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Get a Handle on Statistics for Machine Learning!

Statistical Methods for Machine Learning

Develop a working understanding of statistics

...by writing lines of code in python

Discover how in my new Ebook:
Statistical Methods for Machine Learning

It provides self-study tutorials on topics like:
Hypothesis Tests, Correlation, Nonparametric Stats, Resampling, and much more...

Discover how to Transform Data into Knowledge

Skip the Academics. Just Results.

See What's Inside

23 Responses to What is Statistics (and why is it important in machine learning)?

  1. Avatar
    Khan June 29, 2018 at 6:58 am #

    Hi
    If dataset is tall. Then how do we sample it? I mean what methods are used for sample selection.
    Regards

    • Avatar
      Jason Brownlee June 29, 2018 at 3:23 pm #

      By tall, I guess you mean many rows. You can randomly select rows as a sub-sample.

  2. Avatar
    Khan June 29, 2018 at 7:01 pm #

    Yes i mean largw number of rows. But how may i get samples of good quality to represent data in majority. Which method to be used

    • Avatar
      Jason Brownlee June 30, 2018 at 6:06 am #

      Often descriptive statistics can be used to confirm that a data sample is representative of the population.

      Hypothesis tests can confirm these findings.

      • Avatar
        Sufi December 14, 2018 at 11:07 pm #

        I think Khan use Central Limit Theorem by taking different sample means of rows…

  3. Avatar
    Arsh July 3, 2018 at 8:36 pm #

    How can we collaborate these statistic skills with programming and apply them for solving the real world problems, most probably for machine learning and AI problems?

  4. Avatar
    Ajay August 8, 2018 at 7:55 pm #

    Hi,

    1) Is descriptive statistics and EDA are same?
    2) How descriptive statistics used in applied machine learning?
    3) How inferential statistics used in applied machine learning?

    Thank you

    • Avatar
      Jason Brownlee August 9, 2018 at 7:39 am #

      EDA is a process that can use descriptive stats.

      Descriptive stats can inform how to better prepare data for modeling, perhaps.

      ML is applied inference. We are building inductive models.

  5. Avatar
    Cynthia June 25, 2019 at 3:24 am #

    What is normal distribution
    How is it related to sample size and representative sample

      • Avatar
        Cynthia June 27, 2019 at 5:42 am #

        Thank you.
        Is it safe to say, a normal distribution shows a representative sample of the population?

        • Avatar
          Jason Brownlee June 27, 2019 at 8:05 am #

          No. A sample may or may not be normal and may or may not be representative.

          • Avatar
            Cynthia June 28, 2019 at 3:32 am #

            I understand a sample may or may not be normal or representative but if it is normal, would that be representative.

            Thank you.

          • Avatar
            Jason Brownlee June 28, 2019 at 6:10 am #

            I would not make that claim. It could be normal, but underpowered and therefore not representative.

  6. Avatar
    Hamza October 29, 2019 at 3:11 am #

    do classifier depends on mode mean and median if yes then how and why, how these statistics help us in selection of classifier.

  7. Avatar
    Hamza November 1, 2019 at 2:11 am #

    if a dataset has four columns each column has its own mean value… how will we get just one mean for the whole dataset

    • Avatar
      Jason Brownlee November 1, 2019 at 5:40 am #

      You don’t.

      What are you trying to achieve exactly?

      • Avatar
        Hamza November 2, 2019 at 3:23 am #

        if my data set is arranged in 4 columns …. all i want is an overall one mean value for the whole dataset not 4 mean values for four columns.

        • Avatar
          Jason Brownlee November 2, 2019 at 6:49 am #

          If all columns measure the same thing, then perhaps stack them into one column and calculate the mean.

          If not, calculating the mean across columns is invalid and would not have any meaning.

  8. Avatar
    Sri December 17, 2019 at 8:26 pm #

    Please do not make me enter email more than once. Though you are in business, please make it professional. Make it clean and avoid junk.

    • Avatar
      Jason Brownlee December 18, 2019 at 6:02 am #

      I offer many (17+) different mini-courses on a range of topics. I need some gate on each so you don’t get overwhelmed.

Leave a Reply