Last Updated on

Statistics and machine learning are two very closely related fields.

In fact, the line between the two can be very fuzzy at times. Nevertheless, there are methods that clearly belong to the field of statistics that are not only useful, but invaluable when working on a machine learning project.

It would be fair to say that statistical methods are required to effectively work through a machine learning predictive modeling project.

In this post, you will discover specific examples of statistical methods that are useful and required at key steps in a predictive modeling problem.

After completing this post, you will know:

- Exploratory data analysis, data summarization, and data visualizations can be used to help frame your predictive modeling problem and better understand the data.
- That statistical methods can be used to clean and prepare data ready for modeling.
- That statistical hypothesis tests and estimation statistics can aid in model selection and in presenting the skill and predictions from final models.

Discover statistical hypothesis testing, resampling methods, estimation statistics and nonparametric methods in my new book, with 29 step-by-step tutorials and full source code.

Let’s get started.

## Overview

In this post, we are going to look at 10 examples of where statistical methods are used in an applied machine learning project.

This will demonstrate that a working knowledge of statistics is essential for successfully working through a predictive modeling problem.

- Problem Framing
- Data Understanding
- Data Cleaning
- Data Selection
- Data Preparation
- Model Evaluation
- Model Configuration
- Model Selection
- Model Presentation
- Model Predictions

### Need help with Statistics for Machine Learning?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

## 1. Problem Framing

Perhaps the point of biggest leverage in a predictive modeling problem is the framing of the problem.

This is the selection of the type of problem, e.g. regression or classification, and perhaps the structure and types of the inputs and outputs for the problem.

The framing of the problem is not always obvious. For newcomers to a domain, it may require significant exploration of the observations in the domain.

For domain experts that may be stuck seeing the issues from a conventional perspective, they too may benefit from considering the data from multiple perspectives.

Statistical methods that can aid in the exploration of the data during the framing of a problem include:

**Exploratory Data Analysis**. Summarization and visualization in order to explore ad hoc views of the data.**Data Mining**. Automatic discovery of structured relationships and patterns in the data.

## 2. Data Understanding

Data understanding means having an intimate grasp of both the distributions of variables and the relationships between variables.

Some of this knowledge may come from domain expertise, or require domain expertise in order to interpret. Nevertheless, both experts and novices to a field of study will benefit from actually handeling real observations form the domain.

Two large branches of statistical methods are used to aid in understanding data; they are:

**Summary Statistics**. Methods used to summarize the distribution and relationships between variables using statistical quantities.**Data Visualization**. Methods used to summarize the distribution and relationships between variables using visualizations such as charts, plots, and graphs.

## 3. Data Cleaning

Observations from a domain are often not pristine.

Although the data is digital, it may be subjected to processes that can damage the fidelity of the data, and in turn any downstream processes or models that make use of the data.

Some examples include:

- Data corruption.
- Data errors.
- Data loss.

The process of identifying and repairing issues with the data is called data cleaning

Statistical methods are used for data cleaning; for example:

**Outlier detection**. Methods for identifying observations that are far from the expected value in a distribution.**Imputation**. Methods for repairing or filling in corrupt or missing values in observations.

## 4. Data Selection

Not all observations or all variables may be relevant when modeling.

The process of reducing the scope of data to those elements that are most useful for making predictions is called data selection.

Two types of statistical methods that are used for data selection include:

**Data Sample**. Methods to systematically create smaller representative samples from larger datasets.**Feature Selection**. Methods to automatically identify those variables that are most relevant to the outcome variable.

## 5. Data Preparation

Data can often not be used directly for modeling.

Some transformation is often required in order to change the shape or structure of the data to make it more suitable for the chosen framing of the problem or learning algorithms.

Data preparation is performed using statistical methods. Some common examples include:

**Scaling**. Methods such as standardization and normalization.**Encoding**. Methods such as integer encoding and one hot encoding.**Transforms**. Methods such as power transforms like the Box-Cox method.

## 6. Model Evaluation

A crucial part of a predictive modeling problem is evaluating a learning method.

This often requires the estimation of the skill of the model when making predictions on data not seen during the training of the model.

Generally, the planning of this process of training and evaluating a predictive model is called experimental design. This is a whole subfield of statistical methods.

**Experimental Design**. Methods to design systematic experiments to compare the effect of independent variables on an outcome, such as the choice of a machine learning algorithm on prediction accuracy.

As part of implementing an experimental design, methods are used to resample a dataset in order to make economic use of available data in order to estimate the skill of the model. These two represent a subfield of statistical methods.

**Resampling Methods**. Methods for systematically splitting a dataset into subsets for the purposes of training and evaluating a predictive model.

## 7. Model Configuration

A given machine learning algorithm often has a suite of hyperparameters that allow the learning method to be tailored to a specific problem.

The configuration of the hyperparameters is often empirical in nature, rather than analytical, requiring large suites of experiments in order to evaluate the effect of different hyperparameter values on the skill of the model.

The interpretation and comparison of the results between different hyperparameter configurations is made using one of two subfields of statistics, namely:

**Statistical Hypothesis Tests**. Methods that quantify the likelihood of observing the result given an assumption or expectation about the result (presented using critical values and p-values).**Estimation Statistics**. Methods that quantify the uncertainty of a result using confidence intervals.

## 8. Model Selection

One among many machine learning algorithms may be appropriate for a given predictive modeling problem.

The process of selecting one method as the solution is called model selection.

This may involve a suite of criteria both from stakeholders in the project and the careful interpretation of the estimated skill of the methods evaluated for the problem.

As with model configuration, two classes of statistical methods can be used to interpret the estimated skill of different models for the purposes of model selection. They are:

**Statistical Hypothesis Tests**. Methods that quantify the likelihood of observing the result given an assumption or expectation about the result (presented using critical values and p-values).

**Estimation Statistics**. Methods that quantify the uncertainty of a result using confidence intervals.

## 9. Model Presentation

Once a final model has been trained, it can be presented to stakeholders prior to being used or deployed to make actual predictions on real data.

A part of presenting a final model involves presenting the estimated skill of the model.

Methods from the field of estimation statistics can be used to quantify the uncertainty in the estimated skill of the machine learning model through the use of tolerance intervals and confidence intervals.

**Estimation Statistics**. Methods that quantify the uncertainty in the skill of a model via confidence intervals.

## 10. Model Predictions

Finally, it will come time to start using a final model to make predictions for new data where we do not know the real outcome.

As part of making predictions, it is important to quantify the confidence of the prediction.

Just like with the process of model presentation, we can use methods from the field of estimation statistics to quantify this uncertainty, such as confidence intervals and prediction intervals.

**Estimation Statistics**. Methods that quantify the uncertainty for a prediction via prediction intervals.

## Summary

In this tutorial, you discovered the importance of statistical methods throughout the process of working through a predictive modeling project.

Specifically, you learned:

- Exploratory data analysis, data summarization, and data visualizations can be used to help frame your predictive modeling problem and better understand the data.
- That statistical methods can be used to clean and prepare data ready for modeling.
- That statistical hypothesis tests and estimation statistics can aid in model selection and in presenting the skill and predictions from final models.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

Saying that statistical methods are useful in the machine learning field is like saying that wood working methods are useful for a carpenter. Putting data management aside, the whole meaning of machine learning is applying statistical methods on data.

Perhaps what makes me wonder is the following question – is machine learning possible without the use of statistics?

Thanks.

Sure, the stats guys never considered the actual model methods as statistical, e.g. CART, Neural Nets, SVM, etc. They considered and still consider them algorithms from comp sci.

To much statistics knowledge is also not good for creating machine learning model. Sometimes a statician ignore a feature because he thinks it does not much affect the dependent variable. But in fact in prediction, combination of features can create very good prediction power.

It can happen.

Statistical knowledge is very important and useful, but it is “only” domain knowledge and thus only one tool in the toolbox. We solve this (sometimes) by involving multiple divisions in this process. These people have a different perspective on the problem to be solved. Statistical methods provide the basis, complement and verify. Then take a look at the differences and it will get interesting. Crunch time 😉

during the representation of feature on distribution ,if it is skewed in left or right what is the next to follow . is the feature must have a distribution of normal. how confidence interval and hypothesis testing is used in model building.

You can use a power transform to fix a skew.

Some algorithms prefer data to have a gaussian distribution.

A confidence interval is used in the presentation of model skill. Hypothesis test is used to confirm that the differences between models is real.

Great summary of usage of stats in machine learning. Particularly usage of inferential statistics in ML.

Thanks.

In order to make a ML model that can predict the labels ,is it compulsory to use these statistical methods?

To develop a robust and skilful model, I think yes.

Is this more towards supervised learning

Yes, the focus of this blog and this post in particular is supervised learning.

Hello, it may not be the right publication to make this query, I apologize in advance.

Well, it turns out that I have a database that I intend to analyze in order to obtain some prediction. But it turns out that these data are not numerical (so to speak) but words. I will give you a small context, it turns out that the data I have are telecommunications equipment alarms, these alarms are categorized by a priority level, in addition there are other types of parameters that show characteristics of the equipment in question.

How would you deal with this case? I had thought about binarize my data, leaving only with level 1 what I want to predict and level 0 for others, but I think it would lose intrinsic characteristics of the system, Is there any method that allows the treatment of these situations?

Thank you very much for your work Jason, it has been very helpful.

I would recommend encoding the words or text. An integer encoding, bag of words or one hot encoding might be a good place to start. More advanced encoding may follow.

Perhaps start with some techniques described here:

https://machinelearningmastery.com/start-here/#nlp