The post A Gentle Introduction to Exponential Smoothing for Time Series Forecasting in Python appeared first on Machine Learning Mastery.

]]>It is a powerful forecasting method that may be used as an alternative to the popular Box-Jenkins ARIMA family of methods.

In this tutorial, you will discover the exponential smoothing method for univariate time series forecasting.

After completing this tutorial, you will know:

- What exponential smoothing is and how it is different from other forecasting methods.
- The three main types of exponential smoothing and how to configure them.
- How to implement exponential smoothing in Python.

Let’s get started.

This tutorial is divided into 4 parts; they are:

- What Is Exponential Smoothing?
- Types of Exponential Smoothing
- How to Configure Exponential Smoothing
- Exponential Smoothing in Python

Exponential smoothing is a time series forecasting method for univariate data.

Time series methods like the Box-Jenkins ARIMA family of methods develop a model where the prediction is a weighted linear sum of recent past observations or lags.

Exponential smoothing forecasting methods are similar in that a prediction is a weighted sum of past observations, but the model explicitly uses an exponentially decreasing weight for past observations.

Specifically, past observations are weighted with a geometrically decreasing ratio.

Forecasts produced using exponential smoothing methods are weighted averages of past observations, with the weights decaying exponentially as the observations get older. In other words, the more recent the observation the higher the associated weight.

— Page 171, Forecasting: principles and practice, 2013.

Exponential smoothing methods may be considered as peers and an alternative to the popular Box-Jenkins ARIMA class of methods for time series forecasting.

Collectively, the methods are sometimes referred to as ETS models, referring to the explicit modeling of Error, Trend and Seasonality.

There are three main types of exponential smoothing time series forecasting methods.

A simple method that assumes no systematic structure, an extension that explicitly handles trends, and the most advanced approach that add support for seasonality.

Single Exponential Smoothing, SES for short, also called Simple Exponential Smoothing, is a time series forecasting method for univariate data without a trend or seasonality.

It requires a single parameter, called *alpha* (*a*), also called the smoothing factor or smoothing coefficient.

This parameter controls the rate at which the influence of the observations at prior time steps decay exponentially. Alpha is often set to a value between 0 and 1. Large values mean that the model pays attention mainly to the most recent past observations, whereas smaller values mean more of the history is taken into account when making a prediction.

A value close to 1 indicates fast learning (that is, only the most recent values influence the forecasts), whereas a value close to 0 indicates slow learning (past observations have a large influence on forecasts).

— Page 89, Practical Time Series Forecasting with R, 2016.

Hyperparameters:

**Alpha**: Smoothing factor for the level.

Double Exponential Smoothing is an extension to Exponential Smoothing that explicitly adds support for trends in the univariate time series.

In addition to the *alpha* parameter for controlling smoothing factor for the level, an additional smoothing factor is added to control the decay of the influence of the change in trend called *beta* (*b*).

The method supports trends that change in different ways: an additive and a multiplicative, depending on whether the trend is linear or exponential respectively.

Double Exponential Smoothing with an additive trend is classically referred to as Holt’s linear trend model, named for the developer of the method Charles Holt.

**Additive Trend**: Double Exponential Smoothing with a linear trend.**Multiplicative Trend**: Double Exponential Smoothing with an exponential trend.

For longer range (multi-step) forecasts, the trend may continue on unrealistically. As such, it can be useful to dampen the trend over time.

Dampening means reducing the size of the trend over future time steps down to a straight line (no trend).

The forecasts generated by Holt’s linear method display a constant trend (increasing or decreasing) indecently into the future. Even more extreme are the forecasts generated by the exponential trend method […] Motivated by this observation […] introduced a parameter that “dampens” the trend to a flat line some time in the future.

— Page 183, Forecasting: principles and practice, 2013.

As with modeling the trend itself, we can use the same principles in dampening the trend, specifically additively or multiplicatively for a linear or exponential dampening effect. A damping coefficient *Phi* (*p*) is used to control the rate of dampening.

**Additive Dampening**: Dampen a trend linearly.**Multiplicative Dampening**: Dampen the trend exponentially.

Hyperparameters:

**Alpha**: Smoothing factor for the level.**Beta**: Smoothing factor for the trend.**Trend Type**: Additive or multiplicative.**Dampen Type**: Additive or multiplicative.**Phi**: Damping coefficient.

Triple Exponential Smoothing is an extension of Exponential Smoothing that explicitly adds support for seasonality to the univariate time series.

This method is sometimes called Holt-Winters Exponential Smoothing, named for two contributors to the method: Charles Holt and Peter Winters.

In addition to the alpha and beta smoothing factors, a new parameter is added called *gamma* (*g*) that controls the influence on the seasonal component.

As with the trend, the seasonality may be modeled as either an additive or multiplicative process for a linear or exponential change in the seasonality.

**Additive Seasonality**: Triple Exponential Smoothing with a linear seasonality.**Multiplicative Seasonality**: Triple Exponential Smoothing with an exponential seasonality.

Triple exponential smoothing is the most advanced variation of exponential smoothing and through configuration, it can also develop double and single exponential smoothing models.

Being an adaptive method, Holt-Winter’s exponential smoothing allows the level, trend and seasonality patterns to change over time.

— Page 95, Practical Time Series Forecasting with R, 2016.

Additionally, to ensure that the seasonality is modeled correctly, the number of time steps in a seasonal period (*Period*) must be specified. For example, if the series was monthly data and the seasonal period repeated each year, then the Period=12.

Hyperparameters:

**Alpha**: Smoothing factor for the level.**Beta**: Smoothing factor for the trend.**Gamma**: Smoothing factor for the seasonality.**Trend Type**: Additive or multiplicative.**Dampen Type**: Additive or multiplicative.**Phi**: Damping coefficient.**Seasonality Type**: Additive or multiplicative.**Period**: Time steps in seasonal period.

All of the model hyperparameters can be specified explicitly.

This can be challenging for experts and beginners alike.

Instead, it is common to use numerical optimization to search for and fund the smoothing coefficients (*alpha*, *beta*, *gamma*, and *phi*) for the model that result in the lowest error.

[…] a more robust and objective way to obtain values for the unknown parameters included in any exponential smoothing method is to estimate them from the observed data. […] the unknown parameters and the initial values for any exponential smoothing method can be estimated by minimizing the SSE [sum of the squared errors].

— Page 177, Forecasting: principles and practice, 2013.

The parameters that specify the type of change in the trend and seasonality, such as weather they are additive or multiplicative and whether they should be dampened, must be specified explicitly.

This section looks at how to implement exponential smoothing in Python.

The implementations of Exponential Smoothing in Python are provided in the Statsmodels Python library.

The implementations are based on the description of the method in Rob Hyndman and George Athanasopoulos’ excellent book “Forecasting: Principles and Practice,” 2013 and their R implementations in their “forecast” package.

Single Exponential Smoothing or simple smoothing can be implemented in Python via the SimpleExpSmoothing Statsmodels class.

First, an instance of the *SimpleExpSmoothing* class must be instantiated and passed the training data. The *fit()* function is then called providing the fit configuration, specifically the *alpha* value called *smoothing_level*. If this is not provided or set to *None*, the model will automatically optimize the value.

This *fit()* function returns an instance of the *HoltWintersResults* class that contains the learned coefficients. The *forecast()* or the *predict()* function on the result object can be called to make a forecast.

For example:

# single exponential smoothing ... from statsmodels.tsa.holtwinters import SimpleExpSmoothing # prepare data data = ... # create class model = SimpleExpSmoothing(data) # fit model model_fit = model.fit(...) # make prediction yhat = model_fit.predict(...)

Single, Double and Triple Exponential Smoothing can be implemented in Python using the ExponentialSmoothing Statsmodels class.

First, an instance of the ExponentialSmoothing class must be instantiated, specifying both the training data and some configuration for the model.

Specifically, you must specify the following configuration parameters:

**trend**: The type of trend component, as either “*add*” for additive or “*mul*” for multiplicative. Modeling the trend can be disabled by setting it to None.**damped**: Whether or not the trend component should be damped, either*True*or*False*.**seasonal**: The type of seasonal component, as either “*add*” for additive or “*mul*” for multiplicative. Modeling the seasonal component can be disabled by setting it to None.**seasonal_periods**: The number of time steps in a seasonal period, e.g. 12 for 12 months in a yearly seasonal structure (more here).

The model can then be fit on the training data by calling the *fit()* function.

This function allows you to either specify the smoothing coefficients of the exponential smoothing model or have them optimized. By default, they are optimized (e.g. *optimized=True*). These coefficients include:

**smoothing_level**(*alpha*): the smoothing coefficient for the level.**smoothing_slope**(*beta*): the smoothing coefficient for the trend.**smoothing_seasonal**(*gamma*): the smoothing coefficient for the seasonal component.**damping_slope**(*phi*): the coefficient for the damped trend.

Additionally, the fit function can perform basic data preparation prior to modeling; specifically:

**use_boxcox**: Whether or not to perform a power transform of the series (True/False) or specify the lambda for the transform.

The *fit()* function will return an instance of the *HoltWintersResults* class that contains the learned coefficients. The *forecast()* or the *predict()* function on the result object can be called to make a forecast.

# double or triple exponential smoothing ... from statsmodels.tsa.holtwinters import ExponentialSmoothing # prepare data data = ... # create class model = ExponentialSmoothing(data, ...) # fit model model_fit = model.fit(...) # make prediction yhat = model_fit.predict(...)

This section provides more resources on the topic if you are looking to go deeper.

- Chapter 7 Exponential smoothing, Forecasting: principles and practice, 2013.
- Section 6.4. Introduction to Time Series Analysis, Engineering Statistics Handbook, 2012.
- Practical Time Series Forecasting with R, 2016.

- Statsmodels Time Series analysis tsa
- statsmodels.tsa.holtwinters.SimpleExpSmoothing API
- statsmodels.tsa.holtwinters.ExponentialSmoothing API
- statsmodels.tsa.holtwinters.HoltWintersResults API
- forecast: Forecasting Functions for Time Series and Linear Models R package

In this tutorial, you discovered the exponential smoothing method for univariate time series forecasting.

Specifically, you learned:

- What exponential smoothing is and how it is different from other forecast methods.
- The three main types of exponential smoothing and how to configure them.
- How to implement exponential smoothing in Python.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Exponential Smoothing for Time Series Forecasting in Python appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to SARIMA for Time Series Forecasting in Python appeared first on Machine Learning Mastery.

]]>Although the method can handle data with a trend, it does not support time series with a seasonal component.

An extension to ARIMA that supports the direct modeling of the seasonal component of the series is called SARIMA.

In this tutorial, you will discover the Seasonal Autoregressive Integrated Moving Average, or SARIMA, method for time series forecasting with univariate data containing trends and seasonality.

After completing this tutorial, you will know:

- The limitations of ARIMA when it comes to seasonal data.
- The SARIMA extension of ARIMA that explicitly models the seasonal element in univariate data.
- How to implement the SARIMA method in Python using the Statsmodels library.

Let’s get started.

This tutorial is divided into four parts; they are:

- What’s Wrong with ARIMA
- What Is SARIMA?
- How to Configure SARIMA
- How to use SARIMA in Python

Autoregressive Integrated Moving Average, or ARIMA, is a forecasting method for univariate time series data.

As its name suggests, it supports both an autoregressive and moving average elements. The integrated element refers to differencing allowing the method to support time series data with a trend.

A problem with ARIMA is that it does not support seasonal data. That is a time series with a repeating cycle.

ARIMA expects data that is either not seasonal or has the seasonal component removed, e.g. seasonally adjusted via methods such as seasonal differencing.

For more on ARIMA, see the post:

An alternative is to use SARIMA.

Seasonal Autoregressive Integrated Moving Average, SARIMA or Seasonal ARIMA, is an extension of ARIMA that explicitly supports univariate time series data with a seasonal component.

It adds three new hyperparameters to specify the autoregression (AR), differencing (I) and moving average (MA) for the seasonal component of the series, as well as an additional parameter for the period of the seasonality.

A seasonal ARIMA model is formed by including additional seasonal terms in the ARIMA […] The seasonal part of the model consists of terms that are very similar to the non-seasonal components of the model, but they involve backshifts of the seasonal period.

— Page 242, Forecasting: principles and practice, 2013.

Configuring a SARIMA requires selecting hyperparameters for both the trend and seasonal elements of the series.

There are three trend elements that require configuration.

They are the same as the ARIMA model; specifically:

**p**: Trend autoregression order.**d**: Trend difference order.**q**: Trend moving average order.

There are four seasonal elements that are not part of ARIMA that must be configured; they are:

**P**: Seasonal autoregressive order.**D**: Seasonal difference order.**Q**: Seasonal moving average order.**m**: The number of time steps for a single seasonal period.

Together, the notation for an SARIMA model is specified as:

SARIMA(p,d,q)(P,D,Q)m

Where the specifically chosen hyperparameters for a model are specified; for example:

SARIMA(3,1,0)(1,1,0)12

Importantly, the *m* parameter influences the *P*, *D*, and *Q* parameters. For example, an m of 12 for monthly data suggests a yearly seasonal cycle.

A *P*=1 would make use of the first seasonally offset observation in the model, e.g. t-(m*1) or t-12. A *P*=2, would use the last two seasonally offset observations t-(m * 1), t-(m * 2).

Similarly, a *D* of 1 would calculate a first order seasonal difference and a *Q*=1 would use a first order errors in the model (e.g. moving average).

A seasonal ARIMA model uses differencing at a lag equal to the number of seasons (s) to remove additive seasonal effects. As with lag 1 differencing to remove a trend, the lag s differencing introduces a moving average term. The seasonal ARIMA model includes autoregressive and moving average terms at lag s.

— Page 142, Introductory Time Series with R, 2009.

The trend elements can be chosen through careful analysis of ACF and PACF plots looking at the correlations of recent time steps (e.g. 1, 2, 3).

Similarly, ACF and PACF plots can be analyzed to specify values for the seasonal model by looking at correlation at seasonal lag time steps.

For more on interpreting ACF/PACF plots, see the post:

Seasonal ARIMA models can potentially have a large number of parameters and combinations of terms. Therefore, it is appropriate to try out a wide range of models when fitting to data and choose a best fitting model using an appropriate criterion …

— Pages 143-144, Introductory Time Series with R, 2009.

Alternately, a grid search can be used across the trend and seasonal hyperparameters.

For more on grid searching ARIMA parameters, see the post:

The SARIMA time series forecasting method is supported in Python via the Statsmodels library.

To use SARIMA there are three steps, they are:

- Define the model.
- Fit the defined model.
- Make a prediction with the fit model.

Let’s look at each step in turn.

An instance of the SARIMAX class can be created by providing the training data and a host of model configuration parameters.

# specify training data data = ... # define model model = SARIMAX(data, ...)

The implementation is called SARIMAX instead of SARIMA because the “X” addition to the method name means that the implementation also supports exogenous variables.

These are parallel time series variates that are not modeled directly via AR, I, or MA processes, but are made available as a weighted input to the model.

Exogenous variables are optional can be specified via the “*exog*” argument.

# specify training data data = ... # specify additional data other_data = ... # define model model = SARIMAX(data, exog=other_data, ...)

The trend and seasonal hyperparameters are specified as 3 and 4 element tuples respectively to the “*order*” and “*seasonal_order*” arguments.

These elements must be specified.

# specify training data data = ... # define model configuration my_order = (1, 1, 1) my_seasonal_order = (1, 1, 1, 12) # define model model = SARIMAX(data, order=my_order, seasonal_order=my_seasonal_order, ...)

These are the main configuration elements.

There are other fine tuning parameters you may want to configure. Learn more in the full API:

Once the model is created, it can be fit on the training data.

The model is fit by calling the fit() function.

Fitting the model returns an instance of the *SARIMAXResults* class. This object contains the details of the fit, such as the data and coefficients, as well as functions that can be used to make use of the model.

# specify training data data = ... # define model model = SARIMAX(data, order=..., seasonal_order=...) # fit model model_fit = model.fit()

Many elements of the fitting process can be configured, and it is worth reading the API to review these options once you are comfortable with the implementation.

Once fit, the model can be used to make a forecast.

A forecast can be made by calling the *forecast()* or the *predict()* functions on the *SARIMAXResults* object returned from calling fit.

The forecast() function takes a single parameter that specifies the number of out of sample time steps to forecast, or assumes a one step forecast if no arguments are provided.

# specify training data data = ... # define model model = SARIMAX(data, order=..., seasonal_order=...) # fit model model_fit = model.fit() # one step forecast yhat = model_fit.forecast()

The *predict()* function requires a start and end date or index to be specified.

Additionally, if exogenous variables were provided when defining the model, they too must be provided for the forecast period to the *predict()* function.

# specify training data data = ... # define model model = SARIMAX(data, order=..., seasonal_order=...) # fit model model_fit = model.fit() # one step forecast yhat = model_fit.predict(start=len(data), end=len(data))

This section provides more resources on the topic if you are looking to go deeper.

- How to Create an ARIMA Model for Time Series Forecasting with Python
- How to Grid Search ARIMA Model Hyperparameters with Python
- A Gentle Introduction to Autocorrelation and Partial Autocorrelation

- Chapter 8 ARIMA models, Forecasting: principles and practice, 2013.
- Chapter 7, Non-stationary Models, Introductory Time Series with R, 2009.

- Statsmodels Time Series Analysis by State Space Methods
- statsmodels.tsa.statespace.sarimax.SARIMAX API
- statsmodels.tsa.statespace.sarimax.SARIMAXResults API
- Statsmodels SARIMAX Notebook

In this tutorial, you discovered the Seasonal Autoregressive Integrated Moving Average, or SARIMA, method for time series forecasting with univariate data containing trends and seasonality.

Specifically, you learned:

- The limitations of ARIMA when it comes to seasonal data.
- The SARIMA extension of ARIMA that explicitly models the seasonal element in univariate data.
- How to implement the SARIMA method in Python using the Statsmodels library.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to SARIMA for Time Series Forecasting in Python appeared first on Machine Learning Mastery.

]]>The post 15 Statistical Hypothesis Tests in Python (Cheat Sheet) appeared first on Machine Learning Mastery.

]]>applied machine learning, with sample code in Python.

Although there are hundreds of statistical hypothesis tests that you could use, there is only a small subset that you may need to use in a machine learning project.

In this post, you will discover a cheat sheet for the most popular statistical hypothesis tests for a machine learning project with examples using the Python API.

Each statistical test is presented in a consistent way, including:

- The name of the test.
- What the test is checking.
- The key assumptions of the test.
- How the test result is interpreted.
- Python API for using the test.

Note, when it comes to assumptions such as the expected distribution of data or sample size, the results of a given test are likely to degrade gracefully rather than become immediately unusable if an assumption is violated.

Generally, data samples need to be representative of the domain and large enough to expose their distribution to analysis.

In some cases, the data can be corrected to meet the assumptions, such as correcting a nearly normal distribution to be normal by removing outliers, or using a correction to the degrees of freedom in a statistical test when samples have differing variance, to name two examples.

Finally, there may be multiple tests for a given concern, e.g. normality. We cannot get crisp answers to questions with statistics; instead, we get probabilistic answers. As such, we can arrive at different answers to the same question by considering the question in different ways. Hence the need for multiple different tests for some questions we may have about data.

Let’s get started.

This tutorial is divided into four parts; they are:

- Normality Tests
- Correlation Tests
- Parametric Statistical Hypothesis Tests
- Nonparametric Statistical Hypothesis Tests

This section lists statistical tests that you can use to check if your data has a Gaussian distribution.

Tests whether a data sample has a Gaussian distribution.

Assumptions

- Observations in each sample are independent and identically distributed (iid).

Interpretation

- H0: the sample has a Gaussian distribution.
- H1: the sample does not have a Gaussian distribution.

Python Code

from scipy.stats import shapiro data1 = .... stat, p = shapiro(data)

More Information

Tests whether a data sample has a Gaussian distribution.

Assumptions

- Observations in each sample are independent and identically distributed (iid).

Interpretation

- H0: the sample has a Gaussian distribution.
- H1: the sample does not have a Gaussian distribution.

Python Code

from scipy.stats import normaltest data1 = .... stat, p = normaltest(data)

More Information

Tests whether a data sample has a Gaussian distribution.

Assumptions

- Observations in each sample are independent and identically distributed (iid).

Interpretation

- H0: the sample has a Gaussian distribution.
- H1: the sample does not have a Gaussian distribution.

Python Code

from scipy.stats import anderson data1 = .... result = anderson(data)

More Information

This section lists statistical tests that you can use to check if two samples are related.

Tests whether two samples have a monotonic relationship.

Assumptions

- Observations in each sample are independent and identically distributed (iid).
- Observations in each sample are normally distributed.
- Observations in each sample have the same variance.

Interpretation

- H0: the two samples are independent.
- H1: there is a dependency between the samples.

Python Code

from scipy.stats import pearsonr data1, data2 = ... corr, p = pearsonr(data1, data2)

More Information

Tests whether two samples have a monotonic relationship.

Assumptions

- Observations in each sample are independent and identically distributed (iid).
- Observations in each sample can be ranked.

Interpretation

- H0: the two samples are independent.
- H1: there is a dependency between the samples.

Python Code

from scipy.stats import spearmanr data1, data2 = ... corr, p = spearmanr(data1, data2)

More Information

Tests whether two samples have a monotonic relationship.

Assumptions

- Observations in each sample are independent and identically distributed (iid).
- Observations in each sample can be ranked.

Interpretation

- H0: the two samples are independent.
- H1: there is a dependency between the samples.

Python Code

from scipy.stats import kendalltau data1, data2 = ... corr, p = kendalltau(data1, data2)

More Information

Tests whether two categorical variables are related or independent.

Assumptions

- Observations used in the calculation of the contingency table are independent.
- 25 or more examples in each cell of the contingency table.

Interpretation

- H0: the two samples are independent.
- H1: there is a dependency between the samples.

Python Code

from scipy.stats import chi2_contingency table = ... stat, p, dof, expected = chi2_contingency(table)

More Information

This section lists statistical tests that you can use to compare data samples.

Tests whether the means of two independent samples are significantly different.

Assumptions

- Observations in each sample are independent and identically distributed (iid).
- Observations in each sample are normally distributed.
- Observations in each sample have the same variance.

Interpretation

- H0: the means of the samples are equal.
- H1: the means of the samples are unequal.

Python Code

from scipy.stats import ttest_ind data1, data2 = ... stat, p = ttest_ind(data1, data2)

More Information

Tests whether the means of two paired samples are significantly different.

Assumptions

- Observations in each sample are independent and identically distributed (iid).
- Observations in each sample are normally distributed.
- Observations in each sample have the same variance.
- Observations across each sample are paired.

Interpretation

- H0: the means of the samples are equal.
- H1: the means of the samples are unequal.

Python Code

from scipy.stats import ttest_rel data1, data2 = ... stat, p = ttest_rel(data1, data2)

More Information

Tests whether the means of two or more independent samples are significantly different.

Assumptions

- Observations in each sample are independent and identically distributed (iid).
- Observations in each sample are normally distributed.
- Observations in each sample have the same variance.

Interpretation

- H0: the means of the samples are equal.
- H1: one or more of the means of the samples are unequal.

Python Code

from scipy.stats import f_oneway data1, data2, ... = ... stat, p = f_oneway(data1, data2, ...)

More Information

Tests whether the means of two or more paired samples are significantly different.

Assumptions

- Observations in each sample are independent and identically distributed (iid).
- Observations in each sample are normally distributed.
- Observations in each sample have the same variance.
- Observations across each sample are paired.

Interpretation

- H0: the means of the samples are equal.
- H1: one or more of the means of the samples are unequal.

Python Code

Currently not supported in Python.

More Information

Tests whether the distributions of two independent samples are equal or not.

Assumptions

- Observations in each sample are independent and identically distributed (iid).
- Observations in each sample can be ranked.

Interpretation

- H0: the distributions of both samples are equal.
- H1: the distributions of both samples are not equal.

Python Code

from scipy.stats import mannwhitneyu data1, data2 = ... stat, p = mannwhitneyu(data1, data2)

More Information

Tests whether the distributions of two paired samples are equal or not.

Assumptions

- Observations in each sample are independent and identically distributed (iid).
- Observations in each sample can be ranked.
- Observations across each sample are paired.

Interpretation

- H0: the distributions of both samples are equal.
- H1: the distributions of both samples are not equal.

Python Code

from scipy.stats import wilcoxon data1, data2 = ... stat, p = wilcoxon(data1, data2)

More Information

Tests whether the distributions of two or more independent samples are equal or not.

Assumptions

- Observations in each sample are independent and identically distributed (iid).
- Observations in each sample can be ranked.

Interpretation

- H0: the distributions of all samples are equal.
- H1: the distributions of one or more samples are not equal.

Python Code

from scipy.stats import kruskal data1, data2, ... = ... stat, p = kruskal(data1, data2, ...)

More Information

Tests whether the distributions of two or more paired samples are equal or not.

Assumptions

- Observations in each sample are independent and identically distributed (iid).
- Observations in each sample can be ranked.
- Observations across each sample are paired.

Interpretation

- H0: the distributions of all samples are equal.
- H1: the distributions of one or more samples are not equal.

Python Code

from scipy.stats import friedmanchisquare data1, data2, ... = ... stat, p = friedmanchisquare(data1, data2, ...)

More Information

This section provides more resources on the topic if you are looking to go deeper.

- A Gentle Introduction to Normality Tests in Python
- How to Use Correlation to Understand the Relationship Between Variables
- How to Use Parametric Statistical Significance Tests in Python
- A Gentle Introduction to Statistical Hypothesis Tests

In this tutorial, you discovered the key statistical hypothesis tests that you may need to use in a machine learning project.

Specifically, you learned:

- The types of tests to use in different circumstances, such as normality checking, relationships between variables, and differences between samples.
- The key assumptions for each test and how to interpret the test result.
- How to implement the test using the Python API.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

Did I miss an important statistical test or key assumption for one of the listed tests?

Let me know in the comments below.

The post 15 Statistical Hypothesis Tests in Python (Cheat Sheet) appeared first on Machine Learning Mastery.

]]>The post How to Reduce Variance in a Final Machine Learning Model appeared first on Machine Learning Mastery.

]]>A problem with most final models is that they suffer variance in their predictions.

This means that each time you fit a model, you get a slightly different set of parameters that in turn will make slightly different predictions. Sometimes more and sometimes less skillful than what you expected.

This can be frustrating, especially when you are looking to deploy a model into an operational environment.

In this post, you will discover how to think about model variance in a final model and techniques that you can use to reduce the variance in predictions from a final model.

After reading this post, you will know:

- The problem with variance in the predictions made by a final model.
- How to measure model variance and how variance is addressed generally when estimating parameters.
- Techniques you can use to reduce the variance in predictions made by a final model.

Let’s get started.

Once you have discovered which model and model hyperparameters result in the best skill on your dataset, you’re ready to prepare a final model.

A final model is trained on all available data, e.g. the training and the test sets.

It is the model that you will use to make predictions on new data were you do not know the outcome.

The final model is the outcome of your applied machine learning project.

To learn more about preparing a final model, see the post:

The bias-variance trade-off is a conceptual idea in applied machine learning to help understand the sources of error in models.

**Bias**refers to assumptions in the learning algorithm that narrow the scope of what can be learned. This is useful as it can accelerate learning and lead to stable results, at the cost of the assumption differing from reality.**Variance**refers to the sensitivity of the learning algorithm to the specifics of the training data, e.g. the noise and specific observations. This is good as the model will be specialized to the data at the cost of learning random noise and varying each time it is trained on different data.

The bias-variance tradeoff is a conceptual tool to think about these sources of error and how they are always kept in balance.

More bias in an algorithm means that there is less variance, and the reverse is also true.

You can learn more about the bias-variance tradeoff in this post:

You can control this balance.

Many machine learning algorithms have hyperparameters that directly or indirectly allow you to control the bias-variance tradeoff.

For example, the *k* in *k*-nearest neighbors is one example. A small *k* results in predictions with high variance and low bias. A large *k* results in predictions with a small variance and a large bias.

Most final models have a problem: they suffer from variance.

Each time a model is trained by an algorithm with high variance, you will get a slightly different result.

The slightly different model in turn will make slightly different predictions, for better or worse.

This is a problem with training a final model as we are required to use the model to make predictions on real data where we do not know the answer and we want those predictions to as good as possible.

We want to the best possible version of the model that we can get.

We want the variance to play out in our favor.

If we can’t achieve that, at least we want the variance to not fall against us when making predictions.

There are two common sources of variance in a final model:

- The noise in the training data.
- The use of randomness in the machine learning algorithm.

The first type we introduced above.

The second type impacts those algorithms that harness randomness during learning.

Three common examples include:

- Choice of random split points in random forest.
- Random weight initialization in neural networks.
- Shuffling training data in stochastic gradient descent.

You can measure both types of variance in your specific model using your training data.

**Measure Algorithm Variance**: The variance introduced by the stochastic nature of the algorithm can be measured by repeating the evaluation of the algorithm on the same training dataset and calculating the variance or standard deviation of the model skill.**Measure Training Data Variance**: The variance introduced by the training data can be measured by repeating the evaluation of the algorithm on different samples of training data, but keeping the seed for the pseudorandom number generator fixed then calculating the variance or standard deviation of the model skill.

Often, the combined variance is estimated by running repeated k-fold cross-validation on a training dataset then calculating the variance or standard deviation of the model skill.

If we want to reduce the amount of variance in a prediction, we must add bias.

Consider the case of a simple statistical estimate of a population parameter, such as estimating the mean from a small random sample of data.

A single estimate of the mean will have high variance and low bias.

This is intuitive because if we repeated this process 30 times and calculated the standard deviation of the estimated mean values, we would see a large spread.

The solutions for reducing the variance are also intuitive.

Repeat the estimate on many different small samples of data from the domain and calculate the mean of the estimates, leaning on the central limit theorem.

The mean of the estimated means will have a lower variance. We have increased the bias by assuming that the average of the estimates will be a more accurate estimate than a single estimate.

Another approach would be to dramatically increase the size of the data sample on which we estimate the population mean, leaning on the law of large numbers.

The principles used to reduce the variance for a population statistic can also be used to reduce the variance of a final model.

We must add bias.

Depending on the specific form of the final model (e.g. tree, weights, etc.) you can get creative with this idea.

Below are three approaches that you may want to try.

If possible, I recommend designing a test harness to experiment and discover an approach that works best or makes the most sense for your specific data set and machine learning algorithm.

Instead of fitting a single final model, you can fit multiple final models.

Together, the group of final models may be used as an ensemble.

For a given input, each model in the ensemble makes a prediction and the final output prediction is taken as the average of the predictions of the models.

A sensitivity analysis can be used to measure the impact of ensemble size on prediction variance.

As above, multiple final models can be created instead of a single final model.

Instead of calculating the mean of the predictions from the final models, a single final model can be constructed as an ensemble of the parameters of the group of final models.

This would only make sense in cases where each model has the same number of parameters, such as neural network weights or regression coefficients.

For example, consider a linear regression model with three coefficients [b0, b1, b2]. We could fit a group of linear regression models and calculate a final b0 as the average of b0 parameters in each model, and repeat this process for b1 and b2.

Again, a sensitivity analysis can be used to measure the impact of ensemble size on prediction variance.

Leaning on the law of large numbers, perhaps the simplest approach to reduce the model variance is to fit the model on more training data.

In those cases where more data is not readily available, perhaps data augmentation methods can be used instead.

A sensitivity analysis of training dataset size to prediction variance is recommended to find the point of diminishing returns.

There are approaches to preparing a final model that aim to get the variance in the final model to work for you rather than against you.

The commonality in these approaches is that they seek a single best final model.

Two examples include:

**Why not fix the random seed?**You could fix the random seed when fitting the final model. This will constrain the variance introduced by the stochastic nature of the algorithm.**Why not use early stopping?**You could check the skill of the model against a holdout set during training and stop training when the skill of the model on the hold set starts to degrade.

I would argue that these approaches and others like them are fragile.

Perhaps you can gamble and aim for the variance to play-out in your favor. This might be a good approach for machine learning competitions where there is no real downside to losing the gamble.

I won’t.

I think it’s safer to aim for the best average performance and limit the downside.

I think that the trick with navigating the bias-variance tradeoff for a final model is to think in samples, not in terms of single models. To optimize for average model performance.

This section provides more resources on the topic if you are looking to go deeper.

- How to Train a Final Machine Learning Model
- Gentle Introduction to the Bias-Variance Trade-Off in Machine Learning
- Bias-Variance Tradeoff on Wikipedia
- Checkpoint Ensembles: Ensemble Methods from a Single Training Process, 2017.

In this post, you discovered how to think about model variance in a final model and techniques that you can use to reduce the variance in predictions from a final model.

Specifically, you learned:

- The problem with variance in the predictions made by a final model.
- How to measure model variance and how variance is addressed generally when estimating parameters.
- Techniques you can use to reduce the variance in predictions made by a final model.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Reduce Variance in a Final Machine Learning Model appeared first on Machine Learning Mastery.

]]>The post How to Develop a Skillful Machine Learning Time Series Forecasting Model appeared first on Machine Learning Mastery.

]]>*What do you do?*

This is a common situation; far more common than most people think.

- Perhaps you are sent a CSV file.
- Perhaps you are given access to a database.
- Perhaps you are starting a competition.

The problem can be reasonably well defined:

- You have or can access historical time series data.
- You know or can find out what needs to be forecasted.
- You know or can find out how what is most important in evaluating a candidate model.

So how do you tackle this problem?

Unless you have been through this trial by fire, you may struggle.

- You may struggle because you are new to the fields of machine learning and time series.
- You may struggle even if you have machine learning experience because time series data is different.
- You may struggle even if you have a background in time series forecasting because machine learning methods may outperform the classical approaches on your data.

In all of these cases, you will benefit from working through the problem carefully and systematically.

In this post, I want to give you a specific and actionable procedure that you can use to work through your time series forecasting problem.

Let’s get started.

The goal of this process is to get a “*good enough*” forecast model as fast as possible.

This process may or may not deliver the best possible model, but it will deliver a good model: a model that is better than a baseline prediction, if such a model exists.

Typically, this process will deliver a model that is 80% to 90% of what can be achieved on the problem.

The process is fast. As such, it focuses on automation. Hyperparameters are searched rather than specified based on careful analysis. You are encouraged to test suites of models in parallel, rapidly getting an idea of what works and what doesn’t.

Nevertheless, the process is flexible, allowing you to circle back or go as deep as you like on a given step if you have the time and resources.

This process is divided into four parts; they are:

- Define Problem
- Design Test Harness
- Test Models
- Finalize Model

You will notice that the process is different from a classical linear work-through of a predictive modeling problem. This is because it is designed to get a working forecast model fast and then slow down and see if you can get a better model.

What is your process for working through a new time series forecasting problem?

Share it below in the comments.

The biggest mistake is skipping steps.

For example, the mistake that almost all beginners make is going straight to modeling without a strong idea of what problem is being solved or how to robustly evaluate candidate solutions. This almost always results in a lot of wasted time.

Slow down, follow the process, and complete each step.

I recommend having separate code for each experiment that can be re-run at any time.

This is important so that you can circle back when you discover a bug, fix the code, and re-run an experiment. You are running experiments and iterating quickly, but if you are sloppy, then you cannot trust any of your results. This is especially important when it comes to the design of your test harness for evaluating candidate models.

Let’s take a closer look at each step of the process.

Define your time series problem.

Some topics to consider and motivating questions within each topic are as follows:

- Inputs vs. Outputs
- What are the inputs and outputs for a forecast?

- Endogenous vs. Exogenous
- What are the endogenous and exogenous variables?

- Unstructured vs. Structured
- Are the time series variables unstructured or structured?

- Regression vs. Classification
- Are you working on a regression or classification predictive modeling problem?
- What are some alternate ways to frame your time series forecasting problem?

- Univariate vs. Multivariate
- Are you working on a univariate or multivariate time series problem?

- Single-step vs. Multi-step
- Do you require a single-step or a multi-step forecast?

- Static vs. Dynamic
- Do you require a static or a dynamically updated model?

Answer each question even if you have to estimate or guess.

Some useful tools to help get answers include:

- Data visualizations (e.g. line plots, etc.).
- Statistical analysis (e.g. ACF/PACF plots, etc.).
- Domain experts.
- Project stakeholders.

Update your answers to these questions as you learn more.

Design a test harness that you can use to evaluate candidate models.

This includes both the method used to estimate model skill and the metric used to evaluate predictions.

Below is a common time series forecasting model evaluation scheme if you are looking for ideas:

- Split the dataset into a train and test set.
- Fit a candidate approach on the training dataset.
- Make predictions on the test set directly or using walk-forward validation.
- Calculate a metric that compares the predictions to the expected values.

The test harness must be robust and you must have complete trust in the results it provides.

An important consideration is to ensure that any coefficients used for data preparation are estimated from the training dataset only and then applied on the test set. This might include mean and standard deviation in the case of data standardization.

Test many models using your test harness.

I recommend carefully designing experiments to test a suite of configurations for standard models and letting them run. Each experiment can record results to a file, to allow you to quickly discover the top three to five most skilful configurations from each run.

Some common classes of methods that you can design experiments around include the following:

- Baseline.
- Persistence (grid search the lag observation that is persisted)
- Rolling moving average.
- …

- Autoregression.
- ARMA for stationary data.
- ARIMA for data with a trend.
- SARIMA for data with seasonality.
- …

- Exponential Smoothing.
- Simple Smoothing
- Holt Winters Smoothing
- …

- Linear Machine Learning.
- Linear Regression
- Ridge Regression
- Lasso Regression
- Elastic Net Regression
- ….

- Nonlinear Machine Learning.
- k-Nearest Neighbors
- Classification and Regression Trees
- Support Vector Regression
- …

- Ensemble Machine Learning.
- Bagging
- Boosting
- Random Forest
- Gradient Boosting
- …

- Deep Learning.
- MLP
- CNN
- LSTM
- Hybrids
- …

This list is based on a univariate time series forecasting problem, but you can adapt it for the specifics of your problem, e.g. use VAR/VARMA/etc. in the case of multivariate time series forecasting.

Slot in more of your favorite classical time series forecasting methods and machine learning methods as you see fit.

Order here is important and is structured in increasing complexity from classical to modern methods. Early approaches are simple and give good results fast; later approaches are slower and more complex, but also have a higher bar to clear to be skillful.

The resulting model skill can be used in a ratchet. For example, the skill of the best persistence configuration provide a baseline skill that all other models must outperform. If an autoregression model does better than persistence, it becomes the new level to outperform in order for a method to be considered skilful.

Ideally, you want to exhaust each level before moving on to the next. E.g. get the most out of Autoregression methods and use the results as a new baseline to define “skilful” before moving on to Exponential Smoothing methods.

I put deep learning at the end as generally neural networks are poor at time series forecasting, but there is still a lot of room for improvement and experimentation in this area.

The more time and resources that you have, the more configurations that you can evaluate.

For example, with more time and resources, you could:

- Search model configurations at a finer resolution around a configuration known to already perform well.
- Search more model hyperparameter configurations.
- Use analysis to set better bounds on model hyperparameters to be searched.
- Use domain knowledge to better prepare data or engineer input features.
- Explore different potentially more complex methods.
- Explore ensembles of well performing base models.

I also encourage you to include data preparation schemes as hyperparameters for model runs.

Some methods will perform some basic data preparation, such as differencing in ARIMA, nevertheless, it is often unclear exactly what data preparation schemes or combinations of schemes are required to best present a dataset to a modeling algorithm. Rather than guess, grid search and decide based on real results.

Some data preparation schemes to consider include:

- Differencing to remove a trend.
- Seasonal differencing to remove seasonality.
- Standardize to center.
- Normalize to rescale.
- Power Transform to make normal.

So much searching can be slow.

Some ideas to speed up the evaluation of models include:

- Use multiple machines in parallel via cloud hardware (such as Amazon EC2).
- Reduce the size of the train or test dataset to make the evaluation process faster.
- Use a more coarse grid of hyperparameters and circle back if you have time later.
- Perhaps do not refit a model for each step in walk-forward validation.

At the end of the previous time step, you know whether your time series is predictable.

If it is predictable, you will have a list of the top 5 to 10 candidate models that are skillful on the problem.

You can pick one or multiple models and finalize them. This involves training a new final model on all available historical data (train and test).

The model is ready for use; for example:

- Make a prediction for the future.
- Save the model to file for later use in making predictions.
- Incorporate the model into software for making predictions.

If you have time, you can always circle back to the previous step and see if you can further improve upon the final model.

This may be required periodically if the data changes significantly over time.

This section provides more resources on the topic if you are looking to go deeper.

- What Is Time Series Forecasting?
- How to Work Through a Time Series Forecast Project
- How to Train a Final Machine Learning Model

In this post, you discovered a simple four-step process that you can use to quickly discover a skilful predictive model for your time series forecasting problem.

Did you find this process useful?

Let me know below.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Develop a Skillful Machine Learning Time Series Forecasting Model appeared first on Machine Learning Mastery.

]]>The post Taxonomy of Time Series Forecasting Problems appeared first on Machine Learning Mastery.

]]>The choice that you make directly impacts each step of the project from the design of a test harness to evaluate forecast models to the fundamental difficulty of the forecast problem that you are working on.

It is possible to very quickly narrow down the options by working through a series of questions about your time series forecasting problem. By considering a few themes and questions within each theme, you narrow down the type of problem, test harness, and even choice of algorithms for your project.

In this post, you will discover a framework that you can use to quickly understand and frame your time series forecasting problem.

Let’s get started.

Time series forecasting involves developing and using a predictive model on data where there is an ordered relationship between observations.

You can learn more about what time series forecasting is in this post:

Before you get started on your project, you can answer a few questions and greatly improve your understanding of the structure of your forecast problem, the structure of the model requires, and how to evaluate it.

The framework presented in this post is divided into seven parts; they are:

- Inputs vs. Outputs
- Endogenous vs. Exogenous
- Unstructured vs. Structured
- Regression vs. Classification
- Univariate vs. Multivariate
- Single-step vs. Multi-step
- Static vs. Dynamic
- Contiguous vs. Discontiguous

I recommend working through this framework before starting any time series forecasting project.

Your answers may not be crisp on the first time through and the questions may require to you study the data, the domain, and talk to experts and stakeholders.

Update your answers as you learn more as it will help to keep you on track, avoid distractions, and develop the actual model that you need for your project.

Generally, a prediction problem involves using past observations to predict or forecast one or more possible future observations.

The goal is to guess about what might happen in the future.

When you are required to make a forecast, it is critical to think about the data that you will have available to make the forecast and what you will be guessing about the future.

We can summarize this as what are the inputs and outputs of the model when making a single forecast.

**Inputs**: Historical data provided to the model in order to make a single forecast.**Outputs**: Prediction or forecast for a future time step beyond the data provided as input.

The input data is not the data used to train the model. We are not at that point yet. It is the data used to make one forecast, for example the last seven days of sales data to forecast the next one day of sales data.

Defining the inputs and outputs of the model forces you to think about what exactly is or may be required to make a forecast.

You may not be able to be specific when it comes to input data. For example, you may not know whether one or multiple prior time steps are required to make a forecast. But you will be able to identify the variables that could be used to make a forecast.

What are the inputs and outputs for a forecast?

The input data can be further subdivided in order to better understand its relationship to the output variable.

An input variable is endogenous if it is affected by other variables in the system and the output variable depends on it.

In a time series, the observations for an input variable depend upon one another. For example, the observation at time t is dependent upon the observation at t-1; t-1 may depend on t-2, and so on.

An input variable is an exogenous variable if it is independent of other variables in the system and the output variable depends upon it.

Put simply, endogenous variables are influenced by other variables in the system (including themselves) whereas as exogenous variables are not and are considered as outside the system.

**Endogenous**: Input variables that are influenced by other variables in the system and on which the output variable depends.**Exogenous**: Input variables that are not influenced by other variables in the system and on which the output variable depends.

Typically, a time series forecasting problem has endogenous variables (e.g. the output is a function of some number of prior time steps) and may or may not have exogenous variables.

Often, exogenous variables are ignored given the strong focus on the time series. Explicitly thinking about both variable types may help to identify easily overlooked exogenous data or even engineered features that may improve the model.

What are the endogenous and exogenous variables?

Regression predictive modeling problems are those where a quantity is predicted.

A quantity is a numerical value; for example a price, a count, a volume, and so on. A time series forecasting problem in which you want to predict one or more future numerical values is a regression type predictive modeling problem.

Classification predictive modeling problems are those where a category is predicted.

A category is a label from a small well-defined set of labels; for example {“hot”, “cold”}, {“up”, “down”}, and {“buy”, “sell”} are categories.

A time series forecasting problem in which you want to classify input time series data is a classification type predictive modeling problem.

**Regression**: Forecast a numerical quantity.**Classification**: Classify as one of two or more labels.

Are you working on a regression or classification predictive modeling problem?

There is some flexibility between these types.

For example, a regression problem can be reframed as classification and a classification problem can be reframed as regression. Some problems, like predicting an ordinal value, can be framed as either classification and regression. It is possible that a reframing of your time series forecasting problem may simplify it.

What are some alternate ways to frame your time series forecasting problem?

It is useful to plot each variable in a time series and inspect the plot looking for possible patterns.

A time series for a single variable may not have any obvious pattern.

We can think of a series with no pattern as unstructured, as in there is no discernible time-dependent structure.

Alternately, a time series may have obvious patterns, such as a trend or seasonal cycles as structured.

We can often simplify the modeling process by identifying and removing the obvious structures from the data, such as an increasing trend or repeating cycle. Some classical methods even allow you to specify parameters to handle these systematic structures directly.

**Unstructured**: No obvious systematic time-dependent pattern in a time series variable.**Structured**: Systematic time-dependent patterns in a time series variable (e.g. trend and/or seasonality).

Are the time series variables unstructured or structured?

A single variable measured over time is referred to as a univariate time series. Univariate means one variate or one variable.

Multiple variables measured over time is referred to as a multivariate time series: multiple variates or multiple variables.

**Univariate**: One variable measured over time.**Multivariate**: Multiple variables measured over time.

Are you working on a univariate or multivariate time series problem?

Considering this question with regard to inputs and outputs may add a further distinction. The number of variables may differ between the inputs and outputs, e.g. the data may not be symmetrical.

For example, you may have multiple variables as input to the model and only be interested in predicting one of the variables as output. In this case, there is an assumption in the model that the multiple input variables aid and are required in predicting the single output variable.

**Univariate and Multivariate Inputs**: One or multiple input variables measured over time.**Univariate and Multivariate Outputs**: One or multiple output variables to be predicted.

A forecast problem that requires a prediction of the next time step is called a one-step forecast model.

Whereas a forecast problem that requires a prediction of more than one time step is called a multi-step forecast model.

The more time steps to be projected into the future, the more challenging the problem given the compounding nature of the uncertainty on each forecasted time step.

**One-Step**: Forecast the next time step.**Multi-Step**: Forecast more than one future time steps.

Do you require a single-step or a multi-step forecast?

It is possible to develop a model once and use it repeatedly to make predictions.

Given that the model is not updated or changed between forecasts, we can think of this model as being static.

Conversely, we may receive new observations prior to making a subsequent forecast that could be used to create a new model or update the existing model. We can think of developing a new or updated model prior to each forecasts as a dynamic problem.

For example, it if the problem requires a forecast at the beginning of the week for the week ahead, we may receive the true observation at the end of the week that we can use to update the model prior to making next weeks forecast. This would be a dynamic model. If we do not get a true observation at the end of the week or we do and choose to not re-fit the model, this would be a static model.

We may prefer a dynamic model, but the constraints of the domain or limitations of a chosen algorithm may impose constraints that make this intractable.

**Static**. A forecast model is fit once and used to make predictions.**Dynamic**. A forecast model is fit on newly available data prior to each prediction.

Do you require a static or a dynamically updated model?

A time series where the observations are uniform over time may be described as contiguous.

Many time series problems have contiguous observations, such as one observation each hour, day, month or year.

A time series where the observations are not uniform over time may be described as discontiguous.

The lack of uniformity of the observations may be caused by missing or corrupt values. It may also be a feature of the problem where observations are only made available sporadically or at increasingly or decreasingly spaced time intervals.

In the case of non-uniform observations, specific data formatting may be required when fitting some models to make the observations uniform over time.

**Contiguous**. Observations are made uniform over time.**Discontiguous**. Observations are not uniform over time.

Are your observations contiguous or discontiguous?

This section lists some resources for further reading.

To review, the themes and questions you can ask about your problem are as follows:

- Inputs vs. Outputs
- What are the inputs and outputs for a forecast?

- Endogenous vs. Exogenous
- What are the endogenous and exogenous variables?

- Unstructured vs. Structured
- Are the time series variables unstructured or structured?

- Regression vs. Classification
- Are you working on a regression or classification predictive modeling problem?
- What are some alternate ways to frame your time series forecasting problem?

- Univariate vs. Multivariate
- Are you working on a univariate or multivariate time series problem?

- Single-step vs. Multi-step
- Do you require a single-step or a multi-step forecast?

- Static vs. Dynamic
- Do you require a static or a dynamically updated model?

- Contiguous vs. Discontiguous
- Are your observations contiguous or discontiguous?

Did you find this framework useful for your time series forecasting problem?

Let me know in the comments below.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Taxonomy of Time Series Forecasting Problems appeared first on Machine Learning Mastery.

]]>The post 11 Classical Time Series Forecasting Methods in Python (Cheat Sheet) appeared first on Machine Learning Mastery.

]]>Before exploring machine learning methods for time series, it is a good idea to ensure you have exhausted classical linear time series forecasting methods. Classical time series forecasting methods may be focused on linear relationships, nevertheless, they are sophisticated and perform well on a wide range of problems, assuming that your data is suitably prepared and the method is well configured.

In this post, will you will discover a suite of classical methods for time series forecasting that you can test on your forecasting problem prior to exploring to machine learning methods.

The post is structured as a cheat sheet to give you just enough information on each method to get started with a working code example and where to look to get more information on the method.

All code examples are in Python and use the Statsmodels library. The APIs for this library can be tricky for beginners (trust me!), so having a working code example as a starting point will greatly accelerate your progress.

This is a large post; you may want to bookmark it.

Let’s get started.

This cheat sheet demonstrates 11 different classical time series forecasting methods; they are:

- Autoregression (AR)
- Moving Average (MA)
- Autoregressive Moving Average (ARMA)
- Autoregressive Integrated Moving Average (ARIMA)
- Seasonal Autoregressive Integrated Moving-Average (SARIMA)
- Seasonal Autoregressive Integrated Moving-Average with Exogenous Regressors (SARIMAX)
- Vector Autoregression (VAR)
- Vector Autoregression Moving-Average (VARMA)
- Vector Autoregression Moving-Average with Exogenous Regressors (VARMAX)
- Simple Exponential Smoothing (SES)
- Holt Winter’s Exponential Smoothing (HWES)

Did I miss your favorite classical time series forecasting method?

Let me know in the comments below.

Each method is presented in a consistent manner.

This includes:

**Description**. A short and precise description of the technique.**Python Code**. A short working example of fitting the model and making a prediction in Python.**More Information**. References for the API and the algorithm.

Each code example is demonstrated on a simple contrived dataset that may or may not be appropriate for the method. Replace the contrived dataset with your data in order to test the method.

Remember: each method will require tuning to your specific problem. In many cases, I have examples of how to configure and even grid search parameters on the blog already, try the search function.

If you find this cheat sheet useful, please let me know in the comments below.

The autoregression (AR) method models the next step in the sequence as a linear function of the observations at prior time steps.

The notation for the model involves specifying the order of the model p as a parameter to the AR function, e.g. AR(p). For example, AR(1) is a first-order autoregression model.

The method is suitable for univariate time series without trend and seasonal components.

# AR example from statsmodels.tsa.ar_model import AR from random import random # contrived dataset data = [x + random() for x in range(1, 100)] # fit model model = AR(data) model_fit = model.fit() # make prediction yhat = model_fit.predict(len(data), len(data)) print(yhat)

- statsmodels.tsa.ar_model.AR API
- statsmodels.tsa.ar_model.ARResults API
- Autoregressive model on Wikipedia

The moving average (MA) method models the next step in the sequence as a linear function of the residual errors from a mean process at prior time steps.

A moving average model is different from calculating the moving average of the time series.

The notation for the model involves specifying the order of the model q as a parameter to the MA function, e.g. MA(q). For example, MA(1) is a first-order moving average model.

The method is suitable for univariate time series without trend and seasonal components.

We can use the ARMA class to create an MA model and setting a zeroth-order AR model. We must specify the order of the MA model in the order argument.

# MA example from statsmodels.tsa.arima_model import ARMA from random import random # contrived dataset data = [x + random() for x in range(1, 100)] # fit model model = ARMA(data, order=(0, 1)) model_fit = model.fit(disp=False) # make prediction yhat = model_fit.predict(len(data), len(data)) print(yhat)

- statsmodels.tsa.arima_model.ARMA API
- statsmodels.tsa.arima_model.ARMAResults API
- Moving-average model on Wikipedia

The Autoregressive Moving Average (ARMA) method models the next step in the sequence as a linear function of the observations and resiudal errors at prior time steps.

It combines both Autoregression (AR) and Moving Average (MA) models.

The notation for the model involves specifying the order for the AR(p) and MA(q) models as parameters to an ARMA function, e.g. ARMA(p, q). An ARIMA model can be used to develop AR or MA models.

The method is suitable for univariate time series without trend and seasonal components.

# ARMA example from statsmodels.tsa.arima_model import ARMA from random import random # contrived dataset data = [random() for x in range(1, 100)] # fit model model = ARMA(data, order=(2, 1)) model_fit = model.fit(disp=False) # make prediction yhat = model_fit.predict(len(data), len(data)) print(yhat)

- statsmodels.tsa.arima_model.ARMA API
- statsmodels.tsa.arima_model.ARMAResults API
- Autoregressive–moving-average model on Wikipedia

The Autoregressive Integrated Moving Average (ARIMA) method models the next step in the sequence as a linear function of the differenced observations and residual errors at prior time steps.

It combines both Autoregression (AR) and Moving Average (MA) models as well as a differencing pre-processing step of the sequence to make the sequence stationary, called integration (I).

The notation for the model involves specifying the order for the AR(p), I(d), and MA(q) models as parameters to an ARIMA function, e.g. ARIMA(p, d, q). An ARIMA model can also be used to develop AR, MA, and ARMA models.

The method is suitable for univariate time series with trend and without seasonal components.

# ARIMA example from statsmodels.tsa.arima_model import ARIMA from random import random # contrived dataset data = [x + random() for x in range(1, 100)] # fit model model = ARIMA(data, order=(1, 1, 1)) model_fit = model.fit(disp=False) # make prediction yhat = model_fit.predict(len(data), len(data), typ='levels') print(yhat)

- statsmodels.tsa.arima_model.ARIMA API
- statsmodels.tsa.arima_model.ARIMAResults API
- Autoregressive integrated moving average on Wikipedia

The Seasonal Autoregressive Integrated Moving Average (SARIMA) method models the next step in the sequence as a linear function of the differenced observations, errors, differenced seasonal observations, and seasonal errors at prior time steps.

It combines the ARIMA model with the ability to perform the same autoregression, differencing, and moving average modeling at the seasonal level.

The notation for the model involves specifying the order for the AR(p), I(d), and MA(q) models as parameters to an ARIMA function and AR(P), I(D), MA(Q) and m parameters at the seasonal level, e.g. SARIMA(p, d, q)(P, D, Q)m where “m” is the number of time steps in each season (the seasonal period). A SARIMA model can be used to develop AR, MA, ARMA and ARIMA models.

The method is suitable for univariate time series with trend and/or seasonal components.

# SARIMA example from statsmodels.tsa.statespace.sarimax import SARIMAX from random import random # contrived dataset data = [x + random() for x in range(1, 100)] # fit model model = SARIMAX(data, order=(1, 1, 1), seasonal_order=(1, 1, 1, 1)) model_fit = model.fit(disp=False) # make prediction yhat = model_fit.predict(len(data), len(data)) print(yhat)

- statsmodels.tsa.statespace.sarimax.SARIMAX API
- statsmodels.tsa.statespace.sarimax.SARIMAXResults API
- Autoregressive integrated moving average on Wikipedia

The Seasonal Autoregressive Integrated Moving-Average with Exogenous Regressors (SARIMAX) is an extension of the SARIMA model that also includes the modeling of exogenous variables.

Exogenous variables are also called covariates and can be thought of as parallel input sequences that have observations at the same time steps as the original series. The primary series may be referred to as endogenous data to contrast it from the exogenous sequence(s). The observations for exogenous variables are included in the model directly at each time step and are not modeled in the same way as the primary endogenous sequence (e.g. as an AR, MA, etc. process).

The SARIMAX method can also be used to model the subsumed models with exogenous variables, such as ARX, MAX, ARMAX, and ARIMAX.

The method is suitable for univariate time series with trend and/or seasonal components and exogenous variables.

# SARIMAX example from statsmodels.tsa.statespace.sarimax import SARIMAX from random import random # contrived dataset data1 = [x + random() for x in range(1, 100)] data2 = [x + random() for x in range(101, 200)] # fit model model = SARIMAX(data1, exog=data2, order=(1, 1, 1), seasonal_order=(0, 0, 0, 0)) model_fit = model.fit(disp=False) # make prediction exog2 = [200 + random()] yhat = model_fit.predict(len(data1), len(data1), exog=[exog2]) print(yhat)

- statsmodels.tsa.statespace.sarimax.SARIMAX API
- statsmodels.tsa.statespace.sarimax.SARIMAXResults API
- Autoregressive integrated moving average on Wikipedia

The Vector Autoregression (VAR) method models the next step in each time series using an AR model. It is the generalization of AR to multiple parallel time series, e.g. multivariate time series.

The notation for the model involves specifying the order for the AR(p) model as parameters to a VAR function, e.g. VAR(p).

The method is suitable for multivariate time series without trend and seasonal components.

# VAR example from statsmodels.tsa.vector_ar.var_model import VAR from random import random # contrived dataset with dependency data = list() for i in range(100): v1 = i + random() v2 = v1 + random() row = [v1, v2] data.append(row) # fit model model = VAR(data) model_fit = model.fit() # make prediction yhat = model_fit.forecast(model_fit.y, steps=1) print(yhat)

- statsmodels.tsa.vector_ar.var_model.VAR API
- statsmodels.tsa.vector_ar.var_model.VARResults API
- Vector autoregression on Wikipedia

The Vector Autoregression Moving-Average (VARMA) method models the next step in each time series using an ARMA model. It is the generalization of ARMA to multiple parallel time series, e.g. multivariate time series.

The notation for the model involves specifying the order for the AR(p) and MA(q) models as parameters to a VARMA function, e.g. VARMA(p, q). A VARMA model can also be used to develop VAR or VMA models.

The method is suitable for multivariate time series without trend and seasonal components.

# VARMA example from statsmodels.tsa.statespace.varmax import VARMAX from random import random # contrived dataset with dependency data = list() for i in range(100): v1 = random() v2 = v1 + random() row = [v1, v2] data.append(row) # fit model model = VARMAX(data, order=(1, 1)) model_fit = model.fit(disp=False) # make prediction yhat = model_fit.forecast() print(yhat)

- statsmodels.tsa.statespace.varmax.VARMAX API
- statsmodels.tsa.statespace.varmax.VARMAXResults
- Vector autoregression on Wikipedia

The Vector Autoregression Moving-Average with Exogenous Regressors (VARMAX) is an extension of the VARMA model that also includes the modeling of exogenous variables. It is a multivariate version of the ARMAX method.

Exogenous variables are also called covariates and can be thought of as parallel input sequences that have observations at the same time steps as the original series. The primary series(es) are referred to as endogenous data to contrast it from the exogenous sequence(s). The observations for exogenous variables are included in the model directly at each time step and are not modeled in the same way as the primary endogenous sequence (e.g. as an AR, MA, etc. process).

The VARMAX method can also be used to model the subsumed models with exogenous variables, such as VARX and VMAX.

The method is suitable for univariate time series without trend and seasonal components and exogenous variables.

# VARMAX example from statsmodels.tsa.statespace.varmax import VARMAX from random import random # contrived dataset with dependency data = list() for i in range(100): v1 = random() v2 = v1 + random() row = [v1, v2] data.append(row) data_exog = [x + random() for x in range(100)] # fit model model = VARMAX(data, exog=data_exog, order=(1, 1)) model_fit = model.fit(disp=False) # make prediction data_exog2 = [[100]] yhat = model_fit.forecast(exog=data_exog2) print(yhat)

- statsmodels.tsa.statespace.varmax.VARMAX API
- statsmodels.tsa.statespace.varmax.VARMAXResults
- Vector autoregression on Wikipedia

The Simple Exponential Smoothing (SES) method models the next time step as an exponentially weighted linear function of observations at prior time steps.

The method is suitable for univariate time series without trend and seasonal components.

# SES example from statsmodels.tsa.holtwinters import SimpleExpSmoothing from random import random # contrived dataset data = [x + random() for x in range(1, 100)] # fit model model = SimpleExpSmoothing(data) model_fit = model.fit() # make prediction yhat = model_fit.predict(len(data), len(data)) print(yhat)

- statsmodels.tsa.holtwinters.SimpleExpSmoothing API
- statsmodels.tsa.holtwinters.HoltWintersResults API
- Exponential smoothing on Wikipedia

The Holt Winter’s Exponential Smoothing (HWES) also called the Triple Exponential Smoothing method models the next time step as an exponentially weighted linear function of observations at prior time steps, taking trends and seasonality into account.

The method is suitable for univariate time series with trend and/or seasonal components.

# HWES example from statsmodels.tsa.holtwinters import ExponentialSmoothing from random import random # contrived dataset data = [x + random() for x in range(1, 100)] # fit model model = ExponentialSmoothing(data) model_fit = model.fit() # make prediction yhat = model_fit.predict(len(data), len(data)) print(yhat)

- statsmodels.tsa.holtwinters.ExponentialSmoothing API
- statsmodels.tsa.holtwinters.HoltWintersResults API
- Exponential smoothing on Wikipedia

This section provides more resources on the topic if you are looking to go deeper.

In this post, you discovered a suite of classical time series forecasting methods that you can test and tune on your time series dataset.

Did I miss your favorite classical time series forecasting method?

Let me know in the comments below.

Did you try any of these methods on your dataset?

Let me know about your findings in the comments.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post 11 Classical Time Series Forecasting Methods in Python (Cheat Sheet) appeared first on Machine Learning Mastery.

]]>The post Statistics for Machine Learning (7-Day Mini-Course) appeared first on Machine Learning Mastery.

]]>Statistics is a field of mathematics that is universally agreed to be a prerequisite for a deeper understanding of machine learning.

Although statistics is a large field with many esoteric theories and findings, the nuts and bolts tools and notations taken from the field are required for machine learning practitioners. With a solid foundation of what statistics is, it is possible to focus on just the good or relevant parts.

In this crash course, you will discover how you can get started and confidently read and implement statistical methods used in machine learning with Python in seven days.

This is a big and important post. You might want to bookmark it.

Let’s get started.

Before we get started, let’s make sure you are in the right place.

This course is for developers that may know some applied machine learning. Maybe you know how to work through a predictive modeling problem end-to-end, or at least most of the main steps, with popular tools.

The lessons in this course do assume a few things about you, such as:

- You know your way around basic Python for programming.
- You may know some basic NumPy for array manipulation.
- You want to learn statistics to deepen your understanding and application of machine learning.

You do NOT need to know:

- You do not need to be a math wiz!
- You do not need to be a machine learning expert!

This crash course will take you from a developer that knows a little machine learning to a developer who can navigate the basics of statistical methods.

Note: This crash course assumes you have a working Python3 SciPy environment with at least NumPy installed. If you need help with your environment, you can follow the step-by-step tutorial here:

This crash course is broken down into seven lessons.

You could complete one lesson per day (recommended) or complete all of the lessons in one day (hardcore). It really depends on the time you have available and your level of enthusiasm.

Below is a list of the seven lessons that will get you started and productive with statistics for machine learning in Python:

**Lesson 01**: Statistics and Machine Learning**Lesson 02**: Introduction to Statistics**Lesson 03**: Gaussian Distribution and Descriptive Stats**Lesson 04**: Correlation Between Variables**Lesson 05**: Statistical Hypothesis Tests**Lesson 06**: Estimation Statistics**Lesson 07**: Nonparametric Statistics

Each lesson could take you 60 seconds or up to 30 minutes. Take your time and complete the lessons at your own pace. Ask questions and even post results in the comments below.

The lessons expect you to go off and find out how to do things. I will give you hints, but part of the point of each lesson is to force you to learn where to go to look for help on and about the statistical methods and the NumPy API and the best-of-breed tools in Python (hint: I have all of the answers directly on this blog; use the search box).

Post your results in the comments; I’ll cheer you on!

Hang in there; don’t give up.

Note: This is just a crash course. For a lot more detail and fleshed-out tutorials, see my book on the topic titled “Statistical Methods for Machine Learning.”

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

In this lesson, you will discover the five reasons why a machine learning practitioner should deepen their understanding of statistics.

Statistical methods are required in the preparation of train and test data for your machine learning model.

This includes techniques for:

- Outlier detection.
- Missing value imputation.
- Data sampling.
- Data scaling.
- Variable encoding.

And much more.

A basic understanding of data distributions, descriptive statistics, and data visualization is required to help you identify the methods to choose when performing these tasks.

Statistical methods are required when evaluating the skill of a machine learning model on data not seen during training.

This includes techniques for:

- Data sampling.
- Data resampling.
- Experimental design.

Resampling techniques such as k-fold cross-validation are often well understood by machine learning practitioners, but the rationale for why this method is required is not.

Statistical methods are required when selecting a final model or model configuration to use for a predictive modeling problem.

These include techniques for:

- Checking for a significant difference between results.
- Quantifying the size of the difference between results.

This might include the use of statistical hypothesis tests.

Statistical methods are required when presenting the skill of a final model to stakeholders.

This includes techniques for:

- Summarizing the expected skill of the model on average.
- Quantifying the expected variability of the skill of the model in practice.

This might include estimation statistics such as confidence intervals.

Statistical methods are required when making a prediction with a finalized model on new data.

This includes techniques for:

- Quantifying the expected variability for the prediction.

This might include estimation statistics such as prediction intervals.

For this lesson, you must list three reasons why you personally want to learn statistics.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will discover a concise definition of statistics.

In this lesson, you will discover a concise definition of statistics.

Statistics is a required prerequisite for most books and courses on applied machine learning. But what exactly is statistics?

Statistics is a subfield of mathematics. It refers to a collection of methods for working with data and using data to answer questions.

It is because the field is comprised of a grab bag of methods for working with data that it can seem large and amorphous to beginners. It can be hard to see the line between methods that belong to statistics and methods that belong to other fields of study.

When it comes to the statistical tools that we use in practice, it can be helpful to divide the field of statistics into two large groups of methods: descriptive statistics for summarizing data, and inferential statistics for drawing conclusions from samples of data.

**Descriptive Statistics**: Descriptive statistics refer to methods for summarizing raw observations into information that we can understand and share.**Inferential Statistics**: Inferential statistics is a fancy name for methods that aid in quantifying properties of the domain or population from a smaller set of obtained observations called a sample.

For this lesson, you must list three methods that can be used for each descriptive and inferential statistics.

Post your answer in the comments below. I would love to see what you discover.

In the next lesson, you will discover the Gaussian distribution and how to calculate summary statistics.

In this lesson, you will discover the Gaussian distribution for data and how to calculate simple descriptive statistics.

A sample of data is a snapshot from a broader population of all possible observations that could be taken from a domain or generated by a process.

Interestingly, many observations fit a common pattern or distribution called the normal distribution, or more formally, the Gaussian distribution. It is the bell-shaped distribution that you may be familiar with.

A lot is known about the Gaussian distribution, and as such, there are whole sub-fields of statistics and statistical methods that can be used with Gaussian data.

Any Gaussian distribution, and in turn any data sample drawn from a Gaussian distribution, can be summarized with just two parameters:

**Mean**. The central tendency or most likely value in the distribution (the top of the bell).**Variance**. The average difference that observations have from the mean value in the distribution (the spread).

The units of the mean are the same as the units of the distribution, although the units of the variance are squared, and therefore harder to interpret. A popular alternative to the variance parameter is the **standard deviation**, which is simply the square root of the variance, returning the units to be the same as those of the distribution.

The mean, variance, and standard deviation can be calculated directly on data samples in NumPy.

The example below generates a sample of 100 random numbers drawn from a Gaussian distribution with a known mean of 50 and a standard deviation of 5 and calculates the summary statistics.

# calculate summary stats from numpy.random import seed from numpy.random import randn from numpy import mean from numpy import var from numpy import std # seed the random number generator seed(1) # generate univariate observations data = 5 * randn(10000) + 50 # calculate statistics print('Mean: %.3f' % mean(data)) print('Variance: %.3f' % var(data)) print('Standard Deviation: %.3f' % std(data))

Run the example and compare the estimated mean and standard deviation from the expected values.

For this lesson, you must implement the calculation of one descriptive statistic from scratch in Python, such as the calculation of a sample mean.

Post your answer in the comments below. I would love to see what you discover.

In the next lesson, you will discover how to quantify the relationship between two variables.

In this lesson, you will discover how to calculate a correlation coefficient to quantify the relationship between two variables.

Variables in a dataset may be related for lots of reasons.

It can be useful in data analysis and modeling to better understand the relationships between variables. The statistical relationship between two variables is referred to as their correlation.

A correlation could be positive, meaning both variables move in the same direction, or negative, meaning that when one variable’s value increases, the other variables’ values decrease.

**Positive Correlation**: Both variables change in the same direction.**Neutral Correlation**: No relationship in the change of the variables.**Negative Correlation**: Variables change in opposite directions.

The performance of some algorithms can deteriorate if two or more variables are tightly related, called multicollinearity. An example is linear regression, where one of the offending correlated variables should be removed in order to improve the skill of the model.

We can quantify the relationship between samples of two variables using a statistical method called Pearson’s correlation coefficient, named for the developer of the method, Karl Pearson.

The *pearsonr()* NumPy function can be used to calculate the Pearson’s correlation coefficient for samples of two variables.

The complete example is listed below showing the calculation where one variable is dependent upon the second.

# calculate correlation coefficient from numpy.random import seed from numpy.random import randn from scipy.stats import pearsonr # seed random number generator seed(1) # prepare data data1 = 20 * randn(1000) + 100 data2 = data1 + (10 * randn(1000) + 50) # calculate Pearson's correlation corr, p = pearsonr(data1, data2) # display the correlation print('Pearsons correlation: %.3f' % corr)

Run the example and review the calculated correlation coefficient.

For this lesson, you must load a standard machine learning dataset and calculate the correlation between each pair of numerical variables.

Post your answer in the comments below. I would love to see what you discover.

In the next lesson, you will discover statistical hypothesis tests.

In this lesson, you will discover statistical hypothesis tests and how to compare two samples.

Data must be interpreted in order to add meaning. We can interpret data by assuming a specific structure our outcome and use statistical methods to confirm or reject the assumption.

The assumption is called a hypothesis and the statistical tests used for this purpose are called statistical hypothesis tests.

The assumption of a statistical test is called the null hypothesis, or hypothesis zero (H0 for short). It is often called the default assumption, or the assumption that nothing has changed. A violation of the test’s assumption is often called the first hypothesis, hypothesis one, or H1 for short.

**Hypothesis 0 (H0)**: Assumption of the test holds and is failed to be rejected.**Hypothesis 1 (H1)**: Assumption of the test does not hold and is rejected at some level of significance.

We can interpret the result of a statistical hypothesis test using a p-value.

The p-value is the probability of observing the data, given the null hypothesis is true.

A large probability means that the H0 or default assumption is likely. A small value, such as below 5% (o.05) suggests that it is not likely and that we can reject H0 in favor of H1, or that something is likely to be different (e.g. a significant result).

A widely used statistical hypothesis test is the Student’s t-test for comparing the mean values from two independent samples.

The default assumption is that there is no difference between the samples, whereas a rejection of this assumption suggests some significant difference. The tests assumes that both samples were drawn from a Gaussian distribution and have the same variance.

The Student’s t-test can be implemented in Python via the *ttest_ind()* SciPy function.

Below is an example of calculating and interpreting the Student’s t-test for two data samples that are known to be different.

# student's t-test from numpy.random import seed from numpy.random import randn from scipy.stats import ttest_ind # seed the random number generator seed(1) # generate two independent samples data1 = 5 * randn(100) + 50 data2 = 5 * randn(100) + 51 # compare samples stat, p = ttest_ind(data1, data2) print('Statistics=%.3f, p=%.3f' % (stat, p)) # interpret alpha = 0.05 if p > alpha: print('Same distributions (fail to reject H0)') else: print('Different distributions (reject H0)')

Run the code and review the calculated statistic and interpretation of the p-value.

For this lesson, you must list three other statistical hypothesis tests that can be used to check for differences between samples.

Post your answer in the comments below. I would love to see what you discover.

In the next lesson, you will discover estimation statistics as an alternative to statistical hypothesis testing.

In this lesson, you will discover estimation statistics that may be used as an alternative to statistical hypothesis tests.

Statistical hypothesis tests can be used to indicate whether the difference between two samples is due to random chance, but cannot comment on the size of the difference.

A group of methods referred to as “*new statistics*” are seeing increased use instead of or in addition to p-values in order to quantify the magnitude of effects and the amount of uncertainty for estimated values. This group of statistical methods is referred to as estimation statistics.

Estimation statistics is a term to describe three main classes of methods. The three main

classes of methods include:

**Effect Size**. Methods for quantifying the size of an effect given a treatment or intervention.**Interval Estimation**. Methods for quantifying the amount of uncertainty in a value.**Meta-Analysis**. Methods for quantifying the findings across multiple similar studies.

Of the three, perhaps the most useful methods in applied machine learning are interval estimation.

There are three main types of intervals. They are:

**Tolerance Interval**: The bounds or coverage of a proportion of a distribution with a specific level of confidence.**Confidence Interval**: The bounds on the estimate of a population parameter.**Prediction Interval**: The bounds on a single observation.

A simple way to calculate a confidence interval for a classification algorithm is to calculate the binomial proportion confidence interval, which can provide an interval around a model’s estimated accuracy or error.

This can be implemented in Python using the *confint()* Statsmodels function.

The function takes the count of successes (or failures), the total number of trials, and the significance level as arguments and returns the lower and upper bound of the confidence interval.

The example below demonstrates this function in a hypothetical case where a model made 88 correct predictions out of a dataset with 100 instances and we are interested in the 95% confidence interval (provided to the function as a significance of 0.05).

# calculate the confidence interval from statsmodels.stats.proportion import proportion_confint # calculate the interval lower, upper = proportion_confint(88, 100, 0.05) print('lower=%.3f, upper=%.3f' % (lower, upper))

Run the example and review the confidence interval on the estimated accuracy.

For this lesson, you must list two methods for calculating the effect size in applied machine learning and when they might be useful.

As a hint, consider one for the relationship between variables and one for the difference between samples.

Post your answer in the comments below. I would love to see what you discover.

In the next lesson, you will discover nonparametric statistical methods.

In this lesson, you will discover statistical methods that may be used when your data does not come from a Gaussian distribution.

A large portion of the field of statistics and statistical methods is dedicated to data where the distribution is known.

Data in which the distribution is unknown or cannot be easily identified is called nonparametric.

In the case where you are working with nonparametric data, specialized nonparametric statistical methods can be used that discard all information about the distribution. As such, these methods are often referred to as distribution-free methods.

Before a nonparametric statistical method can be applied, the data must be converted into a rank format. As such, statistical methods that expect data in rank format are sometimes called rank statistics, such as rank correlation and rank statistical hypothesis tests. Ranking data is exactly as its name suggests.

The procedure is as follows:

- Sort all data in the sample in ascending order.
- Assign an integer rank from 1 to N for each unique value in the data sample.

A widely used nonparametric statistical hypothesis test for checking for a difference between two independent samples is the Mann-Whitney U test, named for Henry Mann and Donald Whitney.

It is the nonparametric equivalent of the Student’s t-test but does not assume that the data is drawn from a Gaussian distribution.

The test can be implemented in Python via the *mannwhitneyu()* SciPy function.

The example below demonstrates the test on two data samples drawn from a uniform distribution known to be different.

# example of the mann-whitney u test from numpy.random import seed from numpy.random import rand from scipy.stats import mannwhitneyu # seed the random number generator seed(1) # generate two independent samples data1 = 50 + (rand(100) * 10) data2 = 51 + (rand(100) * 10) # compare samples stat, p = mannwhitneyu(data1, data2) print('Statistics=%.3f, p=%.3f' % (stat, p)) # interpret alpha = 0.05 if p > alpha: print('Same distribution (fail to reject H0)') else: print('Different distribution (reject H0)')

Run the example and review the calculated statistics and interpretation of the p-value.

For this lesson, you must list three additional nonparametric statistical methods.

Post your answer in the comments below. I would love to see what you discover.

This was the final lesson in the mini-course.

(Look How Far You Have Come)

You made it. Well done!

Take a moment and look back at how far you have come.

You discovered:

- The importance of statistics in applied machine learning.
- A concise definition of statistics and a division of methods into two main types.
- The Gaussian distribution and how to describe data with this distribution using statistics.
- How to quantify the relationship between the samples of two variables.
- How to check for the difference between two samples using statistical hypothesis tests.
- An alternative to statistical hypothesis tests called estimation statistics.
- Nonparametric methods that can be used when data is not drawn from the Gaussian distribution.

This is just the beginning of your journey with statistics for machine learning. Keep practicing and developing your skills.

Take the next step and check out my book on Statistical Methods for Machine Learning.

How did you do with the mini-course?

Did you enjoy this crash course?

Do you have any questions? Were there any sticking points?

Let me know. Leave a comment below.

The post Statistics for Machine Learning (7-Day Mini-Course) appeared first on Machine Learning Mastery.

]]>The post Why Initialize a Neural Network with Random Weights? appeared first on Machine Learning Mastery.

]]>This is because this is an expectation of the stochastic optimization algorithm used to train the model, called stochastic gradient descent.

To understand this approach to problem solving, you must first understand the role of nondeterministic and randomized algorithms as well as the need for stochastic optimization algorithms to harness randomness in their search process.

In this post, you will discover the full background as to why neural network weights must be randomly initialized.

After reading this post, you will know:

- About the need for nondeterministic and randomized algorithms for challenging problems.
- The use of randomness during initialization and search in stochastic optimization algorithms.
- That stochastic gradient descent is a stochastic optimization algorithm and requires the random initialization of network weights.

Let’s get started.

This post is divided into 4 parts; they are:

- Deterministic and Non-Deterministic Algorithms
- Stochastic Search Algorithms
- Random Initialization in Neural Networks
- Initialization Methods

Classical algorithms are deterministic.

An example is an algorithm to sort a list.

Given an unsorted list, the sorting algorithm, say bubble sort or quick sort, will systematically sort the list until you have an ordered result. Deterministic means that each time the algorithm is given the same list, it will execute in exactly the same way. It will make the same moves at each step of the procedure.

Deterministic algorithms are great as they can make guarantees about best, worst, and average running time. The problem is, they are not suitable for all problems.

Some problems are hard for computers. Perhaps because of the number of combinations; perhaps because of the size of data. They are so hard because a deterministic algorithm cannot be used to solve them efficiently. The algorithm may run, but will continue running until the heat death of the universe.

An alternate solution is to use nondeterministic algorithms. These are algorithms that use elements of randomness when making decisions during the execution of the algorithm. This means that a different order of steps will be followed when the same algorithm is rerun on the same data.

They can rapidly speed up the process of getting a solution, but the solution will be approximate, or “*good*,” but often not the “*best*.” Nondeterministic algorithms often cannot make strong guarantees about running time or the quality of the solution found.

This is often fine as the problems are so hard that any good solution will often be satisfactory.

Search problems are often very challenging and require the use of nondeterministic algorithms that make heavy use of randomness.

The algorithms are not random per se; instead they make careful use of randomness. They are random within a bound and are referred to as stochastic algorithms.

The incremental, or step-wise, nature of the search often means the process and the algorithms are referred to as an optimization from an initial state or position to a final state or position. For example, stochastic optimization problem or a stochastic optimization algorithm.

Some examples include the genetic algorithm, simulated annealing, and stochastic gradient descent.

The search process is incremental from a starting point in the space of possible solutions toward some good enough solution.

They share common features in their use of randomness, such as:

- Use of randomness during initialization.
- Use of randomness during the progression of the search.

We know nothing about the structure of the search space. Therefore, to remove bias from the search process, we start from a randomly chosen position.

As the search process unfolds, there is a risk that we are stuck in an unfavorable area of the search space. Using randomness during the search process gives some likelihood of getting unstuck and finding a better final candidate solution.

The idea of getting stuck and returning a less-good solution is referred to as getting stuck in a local optima.

These two elements of random initialization and randomness during the search work together.

They work together better if we consider any solution found by the search as provisional, or a candidate, and that the search process can be performed multiple times.

This gives the stochastic search process multiple opportunities to start and traverse the space of candidate solutions in search of a better candidate solution–a so-called global optima.

The navigation of the space of candidate solutions is often described using the analogy of a one- or two-landscape of mountains and valleys (e.g. like a fitness landscape). If we are maximizing a score during the search, we can think of small hills in the landscape as a local optima and the largest hills as the global optima.

This is a fascinating area of research, an area where I have some background. For example, see my book:

Artificial neural networks are trained using a stochastic optimization algorithm called stochastic gradient descent.

The algorithm uses randomness in order to find a good enough set of weights for the specific mapping function from inputs to outputs in your data that is being learned. It means that your specific network on your specific training data will fit a different network with a different model skill each time the training algorithm is run.

This is a feature, not a bug.

I write about this issue more in the post:

As described in the previous section, stochastic optimization algorithms such as stochastic gradient descent use randomness in selecting a starting point for the search and in the progression of the search.

Specifically, stochastic gradient descent requires that the weights of the network are initialized to small random values (random, but close to zero, such as in [0.0, 0.1]). Randomness is also used during the search process in the shuffling of the training dataset prior to each epoch, which in turn results in differences in the gradient estimate for each batch.

You can learn more about stochastic gradient descent in this post:

The progression of the search or learning of a neural network is referred to as convergence. The discovering of a sub-optimal solution or local optima is referred to as premature convergence.

Training algorithms for deep learning models are usually iterative in nature and thus require the user to specify some initial point from which to begin the iterations. Moreover, training deep models is a sufficiently difficult task that most algorithms are strongly affected by the choice of initialization.

— Page 301, Deep Learning, 2016.

The most effective way to evaluate the skill of a neural network configuration is to repeat the search process multiple times and report the average performance of the model over those repeats. This gives the configuration the best chance to search the space from multiple different sets of initial conditions. Sometimes this is called a multiple restart or multiple-restart search.

You can learn more about the effective evaluation of neural networks in this post:

We can use the same set of weights each time we train the network; for example, you could use the values of 0.0 for all weights.

In this case, the equations of the learning algorithm would fail to make any changes to the network weights, and the model will be stuck. It is important to note that the bias weight in each neuron is set to zero by default, not a small random value.

Specifically, nodes that are side-by-side in a hidden layer connected to the same inputs must have different weights for the learning algorithm to update the weights.

This is often referred to as the need to break symmetry during training.

Perhaps the only property known with complete certainty is that the initial parameters need to “break symmetry” between different units. If two hidden units with the same activation function are connected to the same inputs, then these units must have different initial parameters. If they have the same initial parameters, then a deterministic learning algorithm applied to a deterministic cost and model will constantly update both of these units in the same way.

— Page 301, Deep Learning, 2016.

We could use the same set of random numbers each time the network is trained.

This would not be helpful when evaluating network configurations.

It may be helpful in order to train the same final set of network weights given a training dataset in the case where a model is being used in a production environment.

You can learn more about fixing the random seed for neural networks developed with Keras in this post:

Traditionally, the weights of a neural network were set to small random numbers.

The initialization of the weights of neural networks is a whole field of study as the careful initialization of the network can speed up the learning process.

Modern deep learning libraries, such as Keras, offer a host of network initialization methods, all are variations of initializing the weights with small random numbers.

For example, the current methods are available in Keras at the time of writing for all network types:

**Zeros**: Initializer that generates tensors initialized to 0.**Ones**: Initializer that generates tensors initialized to 1.**Constant**: Initializer that generates tensors initialized to a constant value.**RandomNormal**: Initializer that generates tensors with a normal distribution.**RandomUniform**: Initializer that generates tensors with a uniform distribution.**TruncatedNormal**: Initializer that generates a truncated normal distribution.**VarianceScaling**: Initializer capable of adapting its scale to the shape of weights.**Orthogonal**: Initializer that generates a random orthogonal matrix.**Identity**: Initializer that generates the identity matrix.**lecun_uniform**: LeCun uniform initializer.**glorot_normal**: Glorot normal initializer, also called Xavier normal initializer.**glorot_uniform**: Glorot uniform initializer, also called Xavier uniform initializer.**he_normal**: He normal initializer.**lecun_normal**: LeCun normal initializer.**he_uniform**: He uniform variance scaling initializer.

See the documentation for more details.

Out of interest, the default initializers chosen by Keras developers for different layer types are as follows:

**Dense**(e.g. MLP):*glorot_uniform***LSTM**:*glorot_uniform***CNN**:*glorot_uniform*

You can learn more about “*glorot_uniform*“, also called “*Xavier normal*“, named for the developer of the method Xavier Glorot, in the paper:

There is no single best way to initialize the weights of a neural network.

Modern initialization strategies are simple and heuristic. Designing improved initialization strategies is a difficult task because neural network optimization is not yet well understood. […] Our understanding of how the initial point affects generalization is especially primitive, offering little to no guidance for how to select the initial point.

— Page 301, Deep Learning, 2016.

It is one more hyperparameter for you to explore and test and experiment with on your specific predictive modeling problem.

Do you have a favorite method for weight initialization?

Let me know in the comments below.

This section provides more resources on the topic if you are looking to go deeper.

- Deep Learning, 2016.

- Nondeterministic algorithm on Wikipedia
- Randomized algorithm on Wikipedia
- Stochastic optimization on Wikipedia
- Stochastic gradient descent on Wikipedia
- Fitness landscape on Wikipedia
- Neural Network FAQ
- Keras Weight Initialization
- Understanding the difficulty of training deep feedforward neural networks, 2010.

- What are good initial weights in a neural network?
- Why should weights of Neural Networks be initialized to random numbers?
- What are good initial weights in a neural network?

In this post, you discovered why neural network weights must be randomly initialized.

Specifically, you learned:

- About the need for nondeterministic and randomized algorithms for challenging problems.
- The use of randomness during initialization and search in stochastic optimization algorithms.
- That stochastic gradient descent is a stochastic optimization algorithm and requires the random initialization of network weights.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Why Initialize a Neural Network with Random Weights? appeared first on Machine Learning Mastery.

]]>The post How to Code the Student’s t-Test from Scratch in Python appeared first on Machine Learning Mastery.

]]>Because you may use this test yourself someday, it is important to have a deep understanding of how the test works. As a developer, this understanding is best achieved by implementing the hypothesis test yourself from scratch.

In this tutorial, you will discover how to implement the Student’s t-test statistical hypothesis test from scratch in Python.

After completing this tutorial, you will know:

- The Student’s t-test will comment on whether it is likely to observe two samples given that the samples were drawn from the same population.
- How to implement the Student’s t-test from scratch for two independent samples.
- How to implement the paired Student’s t-test from scratch for two dependent samples.

Let’s get started.

This tutorial is divided into three parts; they are:

- Student’s t-Test
- Student’s t-Test for Independent Samples
- Student’s t-Test for Dependent Samples

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

The Student’s t-Test is a statistical hypothesis test for testing whether two samples are expected to have been drawn from the same population.

It is named for the pseudonym “*Student*” used by William Gosset, who developed the test.

The test works by checking the means from two samples to see if they are significantly different from each other. It does this by calculating the standard error in the difference between means, which can be interpreted to see how likely the difference is, if the two samples have the same mean (the null hypothesis).

The t statistic calculated by the test can be interpreted by comparing it to critical values from the t-distribution. The critical value can be calculated using the degrees of freedom and a significance level with the percent point function (PPF).

We can interpret the statistic value in a two-tailed test, meaning that if we reject the null hypothesis, it could be because the first mean is smaller or greater than the second mean. To do this, we can calculate the absolute value of the test statistic and compare it to the positive (right tailed) critical value, as follows:

**If abs(t-statistic) <= critical value**: Accept null hypothesis that the means are equal.**If abs(t-statistic) > critical value**: Reject the null hypothesis that the means are equal.

We can also retrieve the cumulative probability of observing the absolute value of the t-statistic using the cumulative distribution function (CDF) of the t-distribution in order to calculate a p-value. The p-value can then be compared to a chosen significance level (alpha) such as 0.05 to determine if the null hypothesis can be rejected:

**If p > alpha**: Accept null hypothesis that the means are equal.**If p <= alpha**: Reject null hypothesis that the means are equal.

In working with the means of the samples, the test assumes that both samples were drawn from a Gaussian distribution. The test also assumes that the samples have the same variance, and the same size, although there are corrections to the test if these assumptions do not hold. For example, see Welch’s t-test.

There are two main versions of Student’s t-test:

**Independent Samples**. The case where the two samples are unrelated.**Dependent Samples**. The case where the samples are related, such as repeated measures on the same population. Also called a paired test.

Both the independent and the dependent Student’s t-tests are available in Python via the ttest_ind() and ttest_rel() SciPy functions respectively.

**Note**: I recommend using these SciPy functions to calculate the Student’s t-test for your applications, if they are suitable. The library implementations will be faster and less prone to bugs. I would only recommend implementing the test yourself for learning purposes or in the case where you require a modified version of the test.

We will use the SciPy functions to confirm the results from our own version of the tests.

Note, for reference, all calculations presented in this tutorial are taken directly from Chapter 9 “*t Tests*” in “Statistics in Plain English“, Third Edition, 2010. I mention this because you may see the equations with different forms, depending on the reference text that you use.

We’ll start with the most common form of the Student’s t-test: the case where we are comparing the means of two independent samples.

The calculation of the t-statistic for two independent samples is as follows:

t = observed difference between sample means / standard error of the difference between the means

or

t = (mean(X1) - mean(X2)) / sed

Where *X1* and *X2* are the first and second data samples and *sed* is the standard error of the difference between the means.

The standard error of the difference between the means can be calculated as follows:

sed = sqrt(se1^2 + se2^2)

Where *se1* and *se2* are the standard errors for the first and second datasets.

The standard error of a sample can be calculated as:

se = std / sqrt(n)

Where *se* is the standard error of the sample, *std* is the sample standard deviation, and *n* is the number of observations in the sample.

These calculations make the following assumptions:

- The samples are drawn from a Gaussian distribution.
- The size of each sample is approximately equal.
- The samples have the same variance.

We can implement these equations easily using functions from the Python standard library, NumPy and SciPy.

Let’s assume that our two data samples are stored in the variables *data1* and *data2*.

We can start off by calculating the mean for these samples as follows:

# calculate means mean1, mean2 = mean(data1), mean(data2)

We’re halfway there.

Now we need to calculate the standard error.

We can do this manually, first by calculating the sample standard deviations:

# calculate sample standard deviations std1, std2 = std(data1, ddof=1), std(data2, ddof=1)

And then the standard errors:

# calculate standard errors n1, n2 = len(data1), len(data2) se1, se2 = std1/sqrt(n1), std2/sqrt(n2)

Alternately, we can use the *sem()* SciPy function to calculate the standard error directly.

# calculate standard errors se1, se2 = sem(data1), sem(data2)

We can use the standard errors of the samples to calculate the “*standard error on the difference between the samples*“:

# standard error on the difference between the samples sed = sqrt(se1**2.0 + se2**2.0)

We can now calculate the t statistic:

# calculate the t statistic t_stat = (mean1 - mean2) / sed

We can also calculate some other values to help interpret and present the statistic.

The number of degrees of freedom for the test is calculated as the sum of the observations in both samples, minus two.

# degrees of freedom df = n1 + n2 - 2

The critical value can be calculated using the percent point function (PPF) for a given significance level, such as 0.05 (95% confidence).

This function is available for the t distribution in SciPy, as follows:

# calculate the critical value alpha = 0.05 cv = t.ppf(1.0 - alpha, df)

The p-value can be calculated using the cumulative distribution function on the t-distribution, again in SciPy.

# calculate the p-value p = (1 - t.cdf(abs(t_stat), df)) * 2

Here, we assume a two-tailed distribution, where the rejection of the null hypothesis could be interpreted as the first mean is either smaller or larger than the second mean.

We can tie all of these pieces together into a simple function for calculating the t-test for two independent samples:

# function for calculating the t-test for two independent samples def independent_ttest(data1, data2, alpha): # calculate means mean1, mean2 = mean(data1), mean(data2) # calculate standard errors se1, se2 = sem(data1), sem(data2) # standard error on the difference between the samples sed = sqrt(se1**2.0 + se2**2.0) # calculate the t statistic t_stat = (mean1 - mean2) / sed # degrees of freedom df = len(data1) + len(data2) - 2 # calculate the critical value cv = t.ppf(1.0 - alpha, df) # calculate the p-value p = (1.0 - t.cdf(abs(t_stat), df)) * 2.0 # return everything return t_stat, df, cv, p

In this section we will calculate the t-test on some synthetic data samples.

First, let’s generate two samples of 100 Gaussian random numbers with the same variance of 5 and differing means of 50 and 51 respectively. We will expect the test to reject the null hypothesis and find a significant difference between the samples:

# seed the random number generator seed(1) # generate two independent samples data1 = 5 * randn(100) + 50 data2 = 5 * randn(100) + 51

We can calculate the t-test on these samples using the built in SciPy function *ttest_ind()*. This will give us a t-statistic value and a p-value to compare to, to ensure that we have implemented the test correctly.

The complete example is listed below.

# Student's t-test for independent samples from numpy.random import seed from numpy.random import randn from scipy.stats import ttest_ind # seed the random number generator seed(1) # generate two independent samples data1 = 5 * randn(100) + 50 data2 = 5 * randn(100) + 51 # compare samples stat, p = ttest_ind(data1, data2) print('t=%.3f, p=%.3f' % (stat, p))

Running the example, we can see a t-statistic value and p value.

We will use these as our expected values for the test on these data.

t=-2.262, p=0.025

We can now apply our own implementation on the same data, using the function defined in the previous section.

The function will return a t-statistic value and a critical value. We can use the critical value to interpret the t statistic to see if the finding of the test is significant and that indeed the means are different as we expected.

# interpret via critical value if abs(t_stat) <= cv: print('Accept null hypothesis that the means are equal.') else: print('Reject the null hypothesis that the means are equal.')

The function also returns a p-value. We can interpret the p-value using an alpha, such as 0.05 to determine if the finding of the test is significant and that indeed the means are different as we expected.

# interpret via p-value if p > alpha: print('Accept null hypothesis that the means are equal.') else: print('Reject the null hypothesis that the means are equal.')

We expect that both interpretations will always match.

The complete example is listed below.

# t-test for independent samples from math import sqrt from numpy.random import seed from numpy.random import randn from numpy import mean from scipy.stats import sem from scipy.stats import t # function for calculating the t-test for two independent samples def independent_ttest(data1, data2, alpha): # calculate means mean1, mean2 = mean(data1), mean(data2) # calculate standard errors se1, se2 = sem(data1), sem(data2) # standard error on the difference between the samples sed = sqrt(se1**2.0 + se2**2.0) # calculate the t statistic t_stat = (mean1 - mean2) / sed # degrees of freedom df = len(data1) + len(data2) - 2 # calculate the critical value cv = t.ppf(1.0 - alpha, df) # calculate the p-value p = (1.0 - t.cdf(abs(t_stat), df)) * 2.0 # return everything return t_stat, df, cv, p # seed the random number generator seed(1) # generate two independent samples data1 = 5 * randn(100) + 50 data2 = 5 * randn(100) + 51 # calculate the t test alpha = 0.05 t_stat, df, cv, p = independent_ttest(data1, data2, alpha) print('t=%.3f, df=%d, cv=%.3f, p=%.3f' % (t_stat, df, cv, p)) # interpret via critical value if abs(t_stat) <= cv: print('Accept null hypothesis that the means are equal.') else: print('Reject the null hypothesis that the means are equal.') # interpret via p-value if p > alpha: print('Accept null hypothesis that the means are equal.') else: print('Reject the null hypothesis that the means are equal.')

Running the example first calculates the test.

The results of the test are printed, including the t-statistic, the degrees of freedom, the critical value, and the p-value.

We can see that both the t-statistic and p-value match the outputs of the SciPy function. The test appears to be implemented correctly.

The t-statistic and the p-value are then used to interpret the results of the test. We find that as we expect, there is sufficient evidence to reject the null hypothesis, finding that the sample means are likely different.

t=-2.262, df=198, cv=1.653, p=0.025 Reject the null hypothesis that the means are equal. Reject the null hypothesis that the means are equal.

We can now look at the case of calculating the Student’s t-test for dependent samples.

This is the case where we collect some observations on a sample from the population, then apply some treatment, and then collect observations from the same sample.

The result is two samples of the same size where the observations in each sample are related or paired.

The t-test for dependent samples is referred to as the paired Student’s t-test.

The calculation of the paired Student’s t-test is similar to the case with independent samples.

The main difference is in the calculation of the denominator.

t = (mean(X1) - mean(X2)) / sed

Where *X1* and *X2* are the first and second data samples and *sed* is the standard error of the difference between the means.

Here, *sed* is calculated as:

sed = sd / sqrt(n)

Where *sd* is the standard deviation of the difference between the dependent sample means and *n* is the total number of paired observations (e.g. the size of each sample).

The calculation of *sd* first requires the calculation of the sum of the squared differences between the samples:

d1 = sum (X1[i] - X2[i])^2 for i in n

It also requires the sum of the (non squared) differences between the samples:

d2 = sum (X1[i] - X2[i]) for i in n

We can then calculate sd as:

sd = sqrt((d1 - (d2**2 / n)) / (n - 1))

That’s it.

We can implement the calculation of the paired Student’s t-test directly in Python.

The first step is to calculate the means of each sample.

# calculate means mean1, mean2 = mean(data1), mean(data2)

Next, we will require the number of pairs (*n*). We will use this in a few different calculations.

# number of paired samples n = len(data1)

Next, we must calculate the sum of the squared differences between the samples, as well as the sum differences.

# sum squared difference between observations d1 = sum([(data1[i]-data2[i])**2 for i in range(n)]) # sum difference between observations d2 = sum([data1[i]-data2[i] for i in range(n)])

We can now calculate the standard deviation of the difference between means.

# standard deviation of the difference between means sd = sqrt((d1 - (d2**2 / n)) / (n - 1))

This is then used to calculate the standard error of the difference between the means.

# standard error of the difference between the means sed = sd / sqrt(n)

Finally, we have everything we need to calculate the t statistic.

# calculate the t statistic t_stat = (mean1 - mean2) / sed

The only other key difference between this implementation and the implementation for independent samples is the calculation of the number of degrees of freedom.

# degrees of freedom df = n - 1

As before, we can tie all of this together into a reusable function. The function will take two paired samples and a significance level (alpha) and calculate the t-statistic, number of degrees of freedom, critical value, and p-value.

The complete function is listed below.

# function for calculating the t-test for two dependent samples def dependent_ttest(data1, data2, alpha): # calculate means mean1, mean2 = mean(data1), mean(data2) # number of paired samples n = len(data1) # sum squared difference between observations d1 = sum([(data1[i]-data2[i])**2 for i in range(n)]) # sum difference between observations d2 = sum([data1[i]-data2[i] for i in range(n)]) # standard deviation of the difference between means sd = sqrt((d1 - (d2**2 / n)) / (n - 1)) # standard error of the difference between the means sed = sd / sqrt(n) # calculate the t statistic t_stat = (mean1 - mean2) / sed # degrees of freedom df = n - 1 # calculate the critical value cv = t.ppf(1.0 - alpha, df) # calculate the p-value p = (1.0 - t.cdf(abs(t_stat), df)) * 2.0 # return everything return t_stat, df, cv, p

In this section, we will use the same dataset in the worked example as we did for the independent Student’s t-test.

The data samples are not paired, but we will pretend they are. We expect the test to reject the null hypothesis and find a significant difference between the samples.

# seed the random number generator seed(1) # generate two independent samples data1 = 5 * randn(100) + 50 data2 = 5 * randn(100) + 51

As before, we can evaluate the test problem with the SciPy function for calculating a paired t-test. In this case, the *ttest_rel()* function.

The complete example is listed below.

# Paired Student's t-test from numpy.random import seed from numpy.random import randn from scipy.stats import ttest_rel # seed the random number generator seed(1) # generate two independent samples data1 = 5 * randn(100) + 50 data2 = 5 * randn(100) + 51 # compare samples stat, p = ttest_rel(data1, data2) print('Statistics=%.3f, p=%.3f' % (stat, p))

Running the example calculates and prints the t-statistic and the p-value.

We will use these values to validate the calculation of our own paired t-test function.

Statistics=-2.372, p=0.020

We can now test our own implementation of the paired Student’s t-test.

The complete example, including the developed function and interpretation of the results of the function, is listed below.

# t-test for dependent samples from math import sqrt from numpy.random import seed from numpy.random import randn from numpy import mean from scipy.stats import t # function for calculating the t-test for two dependent samples def dependent_ttest(data1, data2, alpha): # calculate means mean1, mean2 = mean(data1), mean(data2) # number of paired samples n = len(data1) # sum squared difference between observations d1 = sum([(data1[i]-data2[i])**2 for i in range(n)]) # sum difference between observations d2 = sum([data1[i]-data2[i] for i in range(n)]) # standard deviation of the difference between means sd = sqrt((d1 - (d2**2 / n)) / (n - 1)) # standard error of the difference between the means sed = sd / sqrt(n) # calculate the t statistic t_stat = (mean1 - mean2) / sed # degrees of freedom df = n - 1 # calculate the critical value cv = t.ppf(1.0 - alpha, df) # calculate the p-value p = (1.0 - t.cdf(abs(t_stat), df)) * 2.0 # return everything return t_stat, df, cv, p # seed the random number generator seed(1) # generate two independent samples (pretend they are dependent) data1 = 5 * randn(100) + 50 data2 = 5 * randn(100) + 51 # calculate the t test alpha = 0.05 t_stat, df, cv, p = dependent_ttest(data1, data2, alpha) print('t=%.3f, df=%d, cv=%.3f, p=%.3f' % (t_stat, df, cv, p)) # interpret via critical value if abs(t_stat) <= cv: print('Accept null hypothesis that the means are equal.') else: print('Reject the null hypothesis that the means are equal.') # interpret via p-value if p > alpha: print('Accept null hypothesis that the means are equal.') else: print('Reject the null hypothesis that the means are equal.')

Running the example calculates the paired t-test on the sample problem.

The calculated t-statistic and p-value match what we expect from the SciPy library implementation. This suggests that the implementation is correct.

The interpretation of the t-test statistic with the critical value, and the p-value with the significance level both find a significant result, rejecting the null hypothesis that the means are equal.

t=-2.372, df=99, cv=1.660, p=0.020 Reject the null hypothesis that the means are equal. Reject the null hypothesis that the means are equal.

This section lists some ideas for extending the tutorial that you may wish to explore.

- Apply each test to your own contrived sample problem.
- Update the independent test and add the correction for samples with different variances and sample sizes.
- Perform a code review of one of the tests implemented in the SciPy library and summarize the differences in the implementation details.

If you explore any of these extensions, I’d love to know.

This section provides more resources on the topic if you are looking to go deeper.

- Statistics in Plain English, Third Edition, 2010.

In this tutorial, you discovered how to implement the Student’s t-test statistical hypothesis test from scratch in Python.

Specifically, you learned:

- The Student’s t-test will comment on whether it is likely to observe two samples given that the samples were drawn from the same population.
- How to implement the Student’s t-test from scratch for two independent samples.
- How to implement the paired Student’s t-test from scratch for two dependent samples.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Code the Student’s t-Test from Scratch in Python appeared first on Machine Learning Mastery.

]]>