The post Hypothesis Test for Comparing Machine Learning Algorithms appeared first on Machine Learning Mastery.

]]>The algorithm with the best mean performance is expected to be better than those algorithms with worse mean performance. But what if the difference in the mean performance is caused by a statistical fluke?

The solution is to use a **statistical hypothesis test** to evaluate whether the difference in the mean performance between any two algorithms is real or not.

In this tutorial, you will discover how to use statistical hypothesis tests for comparing machine learning algorithms.

After completing this tutorial, you will know:

- Performing model selection based on the mean model performance can be misleading.
- The five repeats of two-fold cross-validation with a modified Student’s t-Test is a good practice for comparing machine learning algorithms.
- How to use the MLxtend machine learning to compare algorithms using a statistical hypothesis test.

**Kick-start your project** with my new book Statistics for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

This tutorial is divided into three parts; they are:

- Hypothesis Test for Comparing Algorithms
- 5×2 Procedure With MLxtend
- Comparing Classifier Algorithms

Model selection involves evaluating a suite of different machine learning algorithms or modeling pipelines and comparing them based on their performance.

The model or modeling pipeline that achieves the best performance according to your performance metric is then selected as your final model that you can then use to start making predictions on new data.

This applies to regression and classification predictive modeling tasks with classical machine learning algorithms and deep learning. It’s always the same process.

The problem is, how do you know the difference between two models is real and not just a statistical fluke?

This problem can be addressed using a statistical hypothesis test.

One approach is to evaluate each model on the same k-fold cross-validation split of the data (e.g. using the same random number seed to split the data in each case) and calculate a score for each split. This would give a sample of 10 scores for 10-fold cross-validation. The scores can then be compared using a paired statistical hypothesis test because the same treatment (rows of data) was used for each algorithm to come up with each score. The Paired Student’s t-Test could be used.

A problem with using the Paired Student’s t-Test, in this case, is that each evaluation of the model is not independent. This is because the same rows of data are used to train the data multiple times — actually, each time, except for the time a row of data is used in the hold-out test fold. This lack of independence in the evaluation means that the Paired Student’s t-Test is optimistically biased.

This statistical test can be adjusted to take the lack of independence into account. Additionally, the number of folds and repeats of the procedure can be configured to achieve a good sampling of model performance that generalizes well to a wide range of problems and algorithms. Specifically two-fold cross-validation with five repeats, so-called 5×2-fold cross-validation.

This approach was proposed by Thomas Dietterich in his 1998 paper titled “Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms.”

For more on this topic, see the tutorial:

Thankfully, we don’t need to implement this procedure ourselves.

The MLxtend library by Sebastian Raschka provides an implementation via the *paired_ttest_5x2cv()* function.

First, you must install the mlxtend library, for example:

sudo pip install mlxtend

To use the evaluation, you must first load your dataset, then define the two models that you wish to compare.

... # load data X, y = .... # define models model1 = ... model2 = ...

You can then call the *paired_ttest_5x2cv()* function and pass in your data and models and it will report the t-statistic value and the p-value as to whether the difference in the performance of the two algorithms is significant or not.

... # compare algorithms t, p = paired_ttest_5x2cv(estimator1=model1, estimator2=model2, X=X, y=y)

The p-value must be interpreted using an alpha value, which is the significance level that you are willing to accept.

If the p-value is less or equal to the chosen alpha, we reject the null hypothesis that the models have the same mean performance, which means the difference is probably real. If the p-value is greater than alpha, we fail to reject the null hypothesis that the models have the same mean performance and any observed difference in the mean accuracies is probability a statistical fluke.

The smaller the alpha value, the better, and a common value is 5 percent (0.05).

... # interpret the result if p <= 0.05: print('Difference between mean performance is probably real') else: print('Algorithms probably have the same performance')

Now that we are familiar with the way to use a hypothesis test to compare algorithms, let’s look at some examples.

In this section, let’s compare the performance of two machine learning algorithms on a binary classification task, then check if the observed difference is statistically significant or not.

First, we can use the make_classification() function to create a synthetic dataset with 1,000 samples and 20 input variables.

The example below creates the dataset and summarizes its shape.

# create classification dataset from sklearn.datasets import make_classification # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=10, n_redundant=0, random_state=1) # summarize the dataset print(X.shape, y.shape)

Running the example creates the dataset and summarizes the number of rows and columns, confirming our expectations.

We can use this data as the basis for comparing two algorithms.

(1000, 10) (1000,)

We will compare the performance of two linear algorithms on this dataset. Specifically, a logistic regression algorithm and a linear discriminant analysis (LDA) algorithm.

The procedure I like is to use repeated stratified k-fold cross-validation with 10 folds and three repeats. We will use this procedure to evaluate each algorithm and return and report the mean classification accuracy.

The complete example is listed below.

# compare logistic regression and lda for binary classification from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.linear_model import LogisticRegression from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from matplotlib import pyplot # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=10, n_redundant=0, random_state=1) # evaluate model 1 model1 = LogisticRegression() cv1 = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) scores1 = cross_val_score(model1, X, y, scoring='accuracy', cv=cv1, n_jobs=-1) print('LogisticRegression Mean Accuracy: %.3f (%.3f)' % (mean(scores1), std(scores1))) # evaluate model 2 model2 = LinearDiscriminantAnalysis() cv2 = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) scores2 = cross_val_score(model2, X, y, scoring='accuracy', cv=cv2, n_jobs=-1) print('LinearDiscriminantAnalysis Mean Accuracy: %.3f (%.3f)' % (mean(scores2), std(scores2))) # plot the results pyplot.boxplot([scores1, scores2], labels=['LR', 'LDA'], showmeans=True) pyplot.show()

Running the example first reports the mean classification accuracy for each algorithm.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, the results suggest that LDA has better performance if we just look at the mean scores: 89.2 percent for logistic regression and 89.3 percent for LDA.

LogisticRegression Mean Accuracy: 0.892 (0.036) LinearDiscriminantAnalysis Mean Accuracy: 0.893 (0.033)

A box and whisker plot is also created summarizing the distribution of accuracy scores.

This plot would support my decision in choosing LDA over LR.

Now we can use a hypothesis test to see if the observed results are statistically significant.

First, we will use the 5×2 procedure to evaluate the algorithms and calculate a p-value and test statistic value.

... # check if difference between algorithms is real t, p = paired_ttest_5x2cv(estimator1=model1, estimator2=model2, X=X, y=y, scoring='accuracy', random_seed=1) # summarize print('P-value: %.3f, t-Statistic: %.3f' % (p, t))

We can then interpret the p-value using an alpha of 5 percent.

... # interpret the result if p <= 0.05: print('Difference between mean performance is probably real') else: print('Algorithms probably have the same performance')

Tying this together, the complete example is listed below.

# use 5x2 statistical hypothesis testing procedure to compare two machine learning algorithms from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.linear_model import LogisticRegression from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from mlxtend.evaluate import paired_ttest_5x2cv # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=10, n_redundant=0, random_state=1) # evaluate model 1 model1 = LogisticRegression() cv1 = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) scores1 = cross_val_score(model1, X, y, scoring='accuracy', cv=cv1, n_jobs=-1) print('LogisticRegression Mean Accuracy: %.3f (%.3f)' % (mean(scores1), std(scores1))) # evaluate model 2 model2 = LinearDiscriminantAnalysis() cv2 = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) scores2 = cross_val_score(model2, X, y, scoring='accuracy', cv=cv2, n_jobs=-1) print('LinearDiscriminantAnalysis Mean Accuracy: %.3f (%.3f)' % (mean(scores2), std(scores2))) # check if difference between algorithms is real t, p = paired_ttest_5x2cv(estimator1=model1, estimator2=model2, X=X, y=y, scoring='accuracy', random_seed=1) # summarize print('P-value: %.3f, t-Statistic: %.3f' % (p, t)) # interpret the result if p <= 0.05: print('Difference between mean performance is probably real') else: print('Algorithms probably have the same performance')

Running the example, we first evaluate the algorithms before, then report on the result of the statistical hypothesis test.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the p-value is about 0.3, which is much larger than 0.05. This leads us to fail to reject the null hypothesis, suggesting that any observed difference between the algorithms is probably not real.

We could just as easily choose logistic regression or LDA and both would perform about the same on average.

This highlights that performing model selection based only on the mean performance may not be sufficient.

LogisticRegression Mean Accuracy: 0.892 (0.036) LinearDiscriminantAnalysis Mean Accuracy: 0.893 (0.033) P-value: 0.328, t-Statistic: 1.085 Algorithms probably have the same performance

Recall that we are reporting performance using a different procedure (3×10 CV) than the procedure used to estimate the performance in the statistical test (5×2 CV). Perhaps results would be different if we looked at scores using five repeats of two-fold cross-validation?

The example below is updated to report classification accuracy for each algorithm using 5×2 CV.

# use 5x2 statistical hypothesis testing procedure to compare two machine learning algorithms from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.linear_model import LogisticRegression from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from mlxtend.evaluate import paired_ttest_5x2cv # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=10, n_redundant=0, random_state=1) # evaluate model 1 model1 = LogisticRegression() cv1 = RepeatedStratifiedKFold(n_splits=2, n_repeats=5, random_state=1) scores1 = cross_val_score(model1, X, y, scoring='accuracy', cv=cv1, n_jobs=-1) print('LogisticRegression Mean Accuracy: %.3f (%.3f)' % (mean(scores1), std(scores1))) # evaluate model 2 model2 = LinearDiscriminantAnalysis() cv2 = RepeatedStratifiedKFold(n_splits=2, n_repeats=5, random_state=1) scores2 = cross_val_score(model2, X, y, scoring='accuracy', cv=cv2, n_jobs=-1) print('LinearDiscriminantAnalysis Mean Accuracy: %.3f (%.3f)' % (mean(scores2), std(scores2))) # check if difference between algorithms is real t, p = paired_ttest_5x2cv(estimator1=model1, estimator2=model2, X=X, y=y, scoring='accuracy', random_seed=1) # summarize print('P-value: %.3f, t-Statistic: %.3f' % (p, t)) # interpret the result if p <= 0.05: print('Difference between mean performance is probably real') else: print('Algorithms probably have the same performance')

Running the example reports the mean accuracy for both algorithms and the results of the statistical test.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the difference in the mean performance for the two algorithms is even larger, 89.4 percent vs. 89.0 percent in favor of logistic regression instead of LDA as we saw with 3×10 CV.

LogisticRegression Mean Accuracy: 0.894 (0.012) LinearDiscriminantAnalysis Mean Accuracy: 0.890 (0.013) P-value: 0.328, t-Statistic: 1.085 Algorithms probably have the same performance

This section provides more resources on the topic if you are looking to go deeper.

- 17 Statistical Hypothesis Tests in Python (Cheat Sheet)
- How to Calculate Parametric Statistical Hypothesis Tests in Python
- Statistical Significance Tests for Comparing Machine Learning Algorithms

In this tutorial, you discovered how to use statistical hypothesis tests for comparing machine learning algorithms.

Specifically, you learned:

- Performing model selection based on the mean model performance can be misleading.
- The five repeats of two-fold cross-validation with a modified Student’s t-Test is a good practice for comparing machine learning algorithms.
- How to use the MLxtend machine learning to compare algorithms using a statistical hypothesis test.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Hypothesis Test for Comparing Machine Learning Algorithms appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to Degrees of Freedom in Machine Learning appeared first on Machine Learning Mastery.

]]>It is often employed to summarize the number of values used in the calculation of a statistic, such as a sample statistic or in a statistical hypothesis test.

In machine learning, the degrees of freedom may refer to the number of parameters in the model, such as the number of coefficients in a linear regression model or the number of weights in a deep learning neural network.

The concern is that if there are more degrees of freedom (model parameters) in machine learning, then the model is expected to overfit the training dataset. This is the common understanding from statistics. This expectation can be overcome through the use of regularization techniques, such as regularization linear regression and the suite of regularization methods available for deep learning neural network models.

In this post, you will discover degrees of freedom in statistics and machine learning.

After reading this post, you will know:

- Degrees of freedom generally represents the number of points of control of a system.
- In statistics, degrees of freedom is the number of observations used to calculate a statistic.
- In machine learning, degrees of freedom is the number of parameters of a model.

**Kick-start your project** with my new book Statistics for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

This tutorial is divided into three parts; they are:

- Degrees of Freedom
- Degrees of Freedom in Statistics
- Degrees of Freedom in Machine Learning
- Degrees of Freedom for a Linear Regression Model
- Degrees of Freedom for Linear Regression Error
- Total Degrees of Freedom for Linear Regression
- Negative Degrees of Freedom
- Degrees of Freedom and Overfitting

Degrees of freedom represent the number of points of control of a system, model, or calculation.

Each independent parameter that can change is a separate dimension in a d-dimensional space that defines the scope of values that may influence the system, where the specific observed or specified values are a single point in that space.

Mathematically, the degrees of freedom is often represented using the Greek letter nu, which looks like a lower-case “v”.

It may also be abbreviated as “d.o.f,” “dof,” “d.f.,” or simply “df.”

Degrees of freedom is a term from statistics and engineering and may be used in machine learning.

In statistics, the degrees of freedom is the number of values used in the calculation of a statistic that can change.

Degrees of freedom: Roughly, the minimum amount of data needed to calculate a statistic. More practically, it is a number, or numbers, used to approximate the number of observations in the data set for the purpose of determining statistical significance.

— Page 60, Statistics in Plain English, 3rd Edition, 2010.

It is calculated as the number of independent values used in the calculation of the statistic minus the number of statistics calculated.

- degrees of freedom = number of independent values – number of statistics

For example, we may have 50 independent samples and we wish to calculate a statistic of the sample, like the mean. All 50 samples are used in the calculation and there is one statistic, so the number of degrees of freedom for the mean, in this case, is calculated as:

- degrees of freedom = number of independent values – number of statistics
- degrees of freedom = 50 – 1
- degrees of freedom = 49

Degrees of freedom is often an important consideration in data distributions and statistical hypothesis tests. For example, it used to be common to have tables of statistical test critical values calculated for different common degrees of freedom (before calculating the statistic directly was easy and common).

So far, so good, but what about a model fit from data, such as in machine learning?

In predictive modeling, the degrees of freedom often refers to the number of parameters in the model that are estimated from data.

This can also include both the coefficients of the model and the data used in the calculation of the error of the model.

The best case for understanding this is with a linear regression model.

Consider a linear regression model for a dataset that has two input variables.

We will require one coefficient in the model for each of the input variables, e.g. the model will have two parameters.

This model looks as follows, where *x1* and *x2* are the input variables and *beta1* and *beta2* are the model parameters.

- yhat = x1 * beta1 + x2 * beta2

This linear regression model has two degrees of freedom because there are two parameters in the model that must be estimated from a training dataset. Adding one more column to the data (one more input variable) would add one more degree of freedom for the model.

- model degrees of freedom = number of parameters estimated from data

It is common to describe the complexity of a model fit from data based on the number of parameters that were fit.

For example, the complexity of a linear regression model with two parameters is equal to the degrees of freedom, which in this case is 2. We often prefer lower complexity models over higher complexity models. Simpler models generalize better.

The degrees of freedom are an accounting of how many parameters are estimated by the model and, by extension, a measure of complexity for linear regression models.

— Page 71, Applied Predictive Modeling, 2013.

It’s not over yet.

The number of training examples matters and impacts the overall degrees of freedom for the regression model.

Consider that the coefficients of the linear regression model are fit using a training dataset with 100 rows or examples.

The model is fit by minimizing the error between the model predictions and the expected output values. The total error of the model has one degree of freedom for each example in the training dataset minus the number of parameters estimated from the data.

In this case, the model error has 100 minus 2 parameters from the model, or 98 degrees of freedom.

- model error degrees of freedom = number of observations – number of parameters
- model error degrees of freedom = 100 – 2
- model error degrees of freedom = 98

It is often good practice to report the error of a linear model, like linear regression, including the degrees of freedom of the error.

At the very least, the number of observations in the training data can be included so that the model error degrees of freedom can be determined.

The total degrees of freedom for the linear regression model is taken as the sum of the model degrees of freedom plus the model error degrees of freedom.

- linear regression degrees of freedom = model degrees of freedom + model error degrees of freedom
- linear regression degrees of freedom = 2 + 98
- linear regression degrees of freedom = 100

Generally, the degrees of freedom is equal to the number of rows of training data used to fit the model.

Consider a dataset with 100 rows of data as before, but now we have 70 input variables.

This means that the model has 70 coefficients or parameters fit from the data. The model error would therefore be 100 – 70, or 30 degrees of freedom.

The total degrees of freedom for the model is still equal to the number of rows, or 70 + 30.

What happens when we have more columns than rows of data?

For example, we may have 100 rows of data and 10,000 variables, such as gene markers for 100 patients.

A linear regression model would therefore have 10,000 parameters, meaning the model would have 10,000 degrees of freedom.

We can calculate the model error degrees of freedom as follows:

- model error degrees of freedom = number of observations – number of parameters
- model error degrees of freedom = 100 – 10,000
- model error degrees of freedom = -9,900

**Uh oh.**

And we can calculate the total degrees of freedom as follows:

- linear regression degrees of freedom = model degrees of freedom + model error degrees of freedom
- linear regression degrees of freedom = 10,000 + -9,900
- linear regression degrees of freedom = 100

The model has 100 total degrees of freedom, but the model error has a negative degrees of freedom.

A negative degree of freedom is valid.

It suggests that we have more statistics than we have values that can change. In this case, we have more parameters in the model than we have rows of data or observations to train the model.

This is a so-called *p >> n* or having many more predictors *p* than we do samples *n*.

The problem is that when we have more parameters than observations, there is a risk of overfitting the training dataset.

This is intuitive if we think of each coefficient in the model as a point of control. If we have more points of control in the model than we have observations, we can, in theory, configure the model to predict the training dataset correctly and exactly. Learning the details of the training dataset at the expense of performing well on new data is the definition of overfitting.

This is the general concern that statisticians have about deep learning neural network models.

That is, deep learning models often have many more parameters (model weights) than samples (e.g. billions of weights), and using our understanding of linear models, are expected to overfit.

Nevertheless, through careful selection of model architectures and regularization techniques, they can be prevented from overfitting and maintain low generalization error.

Further, in deep models, the effective degrees of freedom may be decoupled from the number of parameters in the model.

We showed that for simple classification models, degrees of freedom is equal to the number of parameters in the model. In deep networks, the degrees of freedom is generally much less than the number of parameters in the model, and deeper networks tend to have less degrees of freedom.

— Degrees of Freedom in Deep Neural Networks, 2016.

As such, there is a growing trend by statisticians and machine learning practitioners to move away from degrees of freedom for both a proxy for model complexity and as an expectation for overfitting.

To most applied statisticians, a fitting procedure’s degrees of freedom is synonymous with its model complexity, or its capacity for overfitting to data. […] We argue that, on the contrary, model complexity and degrees of freedom may correspond very poorly.

— Effective Degrees Of Freedom: A Flawed Metaphor, 2013.

This section provides more resources on the topic if you are looking to go deeper.

- Degrees of Freedom in Deep Neural Networks, 2016.
- Effective Degrees Of Freedom: A Flawed Metaphor, 2013.

- Statistics in Plain English, 3rd Edition, 2010.
- Applied Predictive Modeling, 2013.

In this post, you discovered degrees of freedom in statistics and machine learning.

Specifically, you learned:

- Degrees of freedom generally represents the number of points of control of a system.
- In statistics, degrees of freedom is the number of observations used to calculate a statistic.
- In machine learning, degrees of freedom is the number of parameters of a model.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Degrees of Freedom in Machine Learning appeared first on Machine Learning Mastery.

]]>The post Arithmetic, Geometric, and Harmonic Means for Machine Learning appeared first on Machine Learning Mastery.

]]>It is an operation you may use every day either directly, such as when summarizing data, or indirectly, such as a smaller step in a larger procedure when fitting a model.

The average is a synonym for the mean, a number that represents the most likely value from a probability distribution. As such, there are multiple different ways to calculate the mean based on the type of data that you’re working with.

This can trip you up if you use the wrong mean for your data. You may also enter some of these more exotic calculations of mean values when using performance metrics to evaluate your model, such as the G-mean or the F-Measure.

In this tutorial, you will discover the difference between the arithmetic mean, the geometric mean, and the harmonic mean.

After completing this tutorial, you will know:

- The central tendency summarizes the most likely value for a variable, and the average is the common name for the calculation of the mean.
- The arithmetic mean is appropriate if the values have the same units, whereas the geometric mean is appropriate if the values have differing units.
- The harmonic mean is appropriate if the data values are ratios of two variables with different measures, called rates.

**Kick-start your project** with my new book Statistics for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

This tutorial is divided into five parts; they are:

- What Is the Average?
- Arithmetic Mean
- Geometric Mean
- Harmonic Mean
- How to Choose the Correct Mean?

The central tendency is a single number that represents the most common value for a list of numbers.

More technically, it is the value that has the highest probability from the probability distribution that describes all possible values that a variable may have.

There are many ways to calculate the central tendency for a data sample, such as the **mean** which is calculated from the values, the **mode**, which is the most common value in the data distribution, or the **median**, which is the middle value if all values in the data sample were ordered.

The average is the common term for the mean. They can be used interchangeably.

The mean is different from the median and the mode in that it is a measure of the central tendency that is calculated from the data. As such, there are different ways to calculate the mean based on the type of data.

Three common types of mean calculations that you may encounter are the **arithmetic mean**, the **geometric mean**, and the **harmonic mean**. There are other means, and many more central tendency measures, but these three means are perhaps the most common (e.g. the so-called Pythagorean means).

Let’s take a closer look at each calculation of the mean in turn.

The arithmetic mean is calculated as the sum of the values divided by the total number of values, referred to as N.

- Arithmetic Mean = (x1 + x2 + … + xN) / N

A more convenient way to calculate the arithmetic mean is to calculate the sum of the values and to multiply it by the reciprocal of the number of values (1 over N); for example:

- Arithmetic Mean = (1/N) * (x1 + x2 + … + xN)

The arithmetic mean is appropriate when all values in the data sample have the same units of measure, e.g. all numbers are heights, or dollars, or miles, etc.

When calculating the arithmetic mean, the values can be positive, negative, or zero.

The arithmetic mean can be easily distorted if the sample of observations contains outliers (a few values far away in feature space from all other values), or for data that has a non-Gaussian distribution (e.g. multiple peaks, a so-called multi-modal probability distribution).

The arithmetic mean is useful in machine learning when summarizing a variable, e.g. reporting the most likely value. This is more meaningful when a variable has a Gaussian or Gaussian-like data distribution.

The arithmetic mean can be calculated using the mean() NumPy function.

The example below demonstrates how to calculate the arithmetic mean for a list of 10 numbers.

# example of calculating the arithmetic mean from numpy import mean # define the dataset data = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] # calculate the mean result = mean(data) print('Arithmetic Mean: %.3f' % result)

Running the example calculates the arithmetic mean and reports the result.

Arithmetic Mean: 4.500

The geometric mean is calculated as the N-th root of the product of all values, where N is the number of values.

- Geometric Mean = N-root(x1 * x2 * … * xN)

For example, if the data contains only two values, the square root of the product of the two values is the geometric mean. For three values, the cube-root is used, and so on.

The geometric mean is appropriate when the data contains values with different units of measure, e.g. some measure are height, some are dollars, some are miles, etc.

The geometric mean does not accept negative or zero values, e.g. all values must be positive.

One common example of the geometric mean in machine learning is in the calculation of the so-called G-Mean (geometric mean) metric that is a model evaluation metric that is calculated as the geometric mean of the sensitivity and specificity metrics.

The geometric mean can be calculated using the gmean() SciPy function.

The example below demonstrates how to calculate the geometric mean for a list of 10 numbers.

# example of calculating the geometric mean from scipy.stats import gmean # define the dataset data = [1, 2, 3, 40, 50, 60, 0.7, 0.88, 0.9, 1000] # calculate the mean result = gmean(data) print('Geometric Mean: %.3f' % result)

Running the example calculates the geometric mean and reports the result.

Geometric Mean: 7.246

The harmonic mean is calculated as the number of values *N* divided by the sum of the reciprocal of the values (1 over each value).

- Harmonic Mean = N / (1/x1 + 1/x2 + … + 1/xN)

If there are just two values (x1 and x2), a simplified calculation of the harmonic mean can be calculated as:

- Harmonic Mean = (2 * x1 * x2) / (x1 + x2)

The harmonic mean is the appropriate mean if the data is comprised of rates.

Recall that a rate is the ratio between two quantities with different measures, e.g. speed, acceleration, frequency, etc.

In machine learning, we have rates when evaluating models, such as the true positive rate or the false positive rate in predictions.

The harmonic mean does not take rates with a negative or zero value, e.g. all rates must be positive.

One common example of the use of the harmonic mean in machine learning is in the calculation of the F-Measure (also the F1-Measure or the Fbeta-Measure); that is a model evaluation metric that is calculated as the harmonic mean of the precision and recall metrics.

The harmonic mean can be calculated using the hmean() SciPy function.

The example below demonstrates how to calculate the harmonic mean for a list of nine numbers.

# example of calculating the harmonic mean from scipy.stats import hmean # define the dataset data = [0.11, 0.22, 0.33, 0.44, 0.55, 0.66, 0.77, 0.88, 0.99] # calculate the mean result = hmean(data) print('Harmonic Mean: %.3f' % result)

Running the example calculates the harmonic mean and reports the result.

Harmonic Mean: 0.350

We have reviewed three different ways of calculating the average or mean of a variable or dataset.

The arithmetic mean is the most commonly used mean, although it may not be appropriate in some cases.

Each mean is appropriate for different types of data; for example:

**If values have the same units**: Use the arithmetic mean.**If values have differing units**: Use the geometric mean.**If values are rates**: Use the harmonic mean.

The exceptions are if the data contains negative or zero values, then the geometric and harmonic means cannot be used directly.

This section provides more resources on the topic if you are looking to go deeper.

- Average, Wikipedia.
- Central tendency, Wikipedia.
- Arithmetic mean, Wikipedia.
- Geometric mean, Wikipedia.
- Harmonic mean, Wikipedia.

In this tutorial, you discovered the difference between the arithmetic mean, the geometric mean, and the harmonic mean.

Specifically, you learned:

- The central tendency summarizes the most likely value for a variable, and the average is the common name for the calculation of the mean.
- The arithmetic mean is appropriate if the values have the same units, whereas the geometric mean is appropriate if the values have differing units.
- The harmonic mean is appropriate if the data values are ratios of two variables with different measures, called rates.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Arithmetic, Geometric, and Harmonic Means for Machine Learning appeared first on Machine Learning Mastery.

]]>The post 17 Statistical Hypothesis Tests in Python (Cheat Sheet) appeared first on Machine Learning Mastery.

]]>applied machine learning, with sample code in Python.

Although there are hundreds of statistical hypothesis tests that you could use, there is only a small subset that you may need to use in a machine learning project.

In this post, you will discover a cheat sheet for the most popular statistical hypothesis tests for a machine learning project with examples using the Python API.

Each statistical test is presented in a consistent way, including:

- The name of the test.
- What the test is checking.
- The key assumptions of the test.
- How the test result is interpreted.
- Python API for using the test.

Note, when it comes to assumptions such as the expected distribution of data or sample size, the results of a given test are likely to degrade gracefully rather than become immediately unusable if an assumption is violated.

Generally, data samples need to be representative of the domain and large enough to expose their distribution to analysis.

In some cases, the data can be corrected to meet the assumptions, such as correcting a nearly normal distribution to be normal by removing outliers, or using a correction to the degrees of freedom in a statistical test when samples have differing variance, to name two examples.

Finally, there may be multiple tests for a given concern, e.g. normality. We cannot get crisp answers to questions with statistics; instead, we get probabilistic answers. As such, we can arrive at different answers to the same question by considering the question in different ways. Hence the need for multiple different tests for some questions we may have about data.

**Kick-start your project** with my new book Statistics for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

**Update Nov/2018**: Added a better overview of the tests covered.**Update Nov/2019**: Added complete working examples of each test. Add time series tests.

This tutorial is divided into 5 parts; they are:

**Normality Tests**- Shapiro-Wilk Test
- D’Agostino’s K^2 Test
- Anderson-Darling Test

**Correlation Tests**- Pearson’s Correlation Coefficient
- Spearman’s Rank Correlation
- Kendall’s Rank Correlation
- Chi-Squared Test

**Stationary Tests**- Augmented Dickey-Fuller
- Kwiatkowski-Phillips-Schmidt-Shin

**Parametric Statistical Hypothesis Tests**- Student’s t-test
- Paired Student’s t-test
- Analysis of Variance Test (ANOVA)
- Repeated Measures ANOVA Test

**Nonparametric Statistical Hypothesis Tests**- Mann-Whitney U Test
- Wilcoxon Signed-Rank Test
- Kruskal-Wallis H Test
- Friedman Test

This section lists statistical tests that you can use to check if your data has a Gaussian distribution.

Tests whether a data sample has a Gaussian distribution.

Assumptions

- Observations in each sample are independent and identically distributed (iid).

Interpretation

- H0: the sample has a Gaussian distribution.
- H1: the sample does not have a Gaussian distribution.

Python Code

# Example of the Shapiro-Wilk Normality Test from scipy.stats import shapiro data = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869] stat, p = shapiro(data) print('stat=%.3f, p=%.3f' % (stat, p)) if p > 0.05: print('Probably Gaussian') else: print('Probably not Gaussian')

More Information

- A Gentle Introduction to Normality Tests in Python
- scipy.stats.shapiro
- Shapiro-Wilk test on Wikipedia

Tests whether a data sample has a Gaussian distribution.

Assumptions

- Observations in each sample are independent and identically distributed (iid).

Interpretation

- H0: the sample has a Gaussian distribution.
- H1: the sample does not have a Gaussian distribution.

Python Code

# Example of the D'Agostino's K^2 Normality Test from scipy.stats import normaltest data = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869] stat, p = normaltest(data) print('stat=%.3f, p=%.3f' % (stat, p)) if p > 0.05: print('Probably Gaussian') else: print('Probably not Gaussian')

More Information

- A Gentle Introduction to Normality Tests in Python
- scipy.stats.normaltest
- D’Agostino’s K-squared test on Wikipedia

Tests whether a data sample has a Gaussian distribution.

Assumptions

- Observations in each sample are independent and identically distributed (iid).

Interpretation

- H0: the sample has a Gaussian distribution.
- H1: the sample does not have a Gaussian distribution.

Python Code

# Example of the Anderson-Darling Normality Test from scipy.stats import anderson data = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869] result = anderson(data) print('stat=%.3f' % (result.statistic)) for i in range(len(result.critical_values)): sl, cv = result.significance_level[i], result.critical_values[i] if result.statistic < cv: print('Probably Gaussian at the %.1f%% level' % (sl)) else: print('Probably not Gaussian at the %.1f%% level' % (sl))

More Information

- A Gentle Introduction to Normality Tests in Python
- scipy.stats.anderson
- Anderson-Darling test on Wikipedia

This section lists statistical tests that you can use to check if two samples are related.

Tests whether two samples have a linear relationship.

Assumptions

- Observations in each sample are independent and identically distributed (iid).
- Observations in each sample are normally distributed.
- Observations in each sample have the same variance.

Interpretation

- H0: the two samples are independent.
- H1: there is a dependency between the samples.

Python Code

# Example of the Pearson's Correlation test from scipy.stats import pearsonr data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869] data2 = [0.353, 3.517, 0.125, -7.545, -0.555, -1.536, 3.350, -1.578, -3.537, -1.579] stat, p = pearsonr(data1, data2) print('stat=%.3f, p=%.3f' % (stat, p)) if p > 0.05: print('Probably independent') else: print('Probably dependent')

More Information

- How to Calculate Correlation Between Variables in Python
- scipy.stats.pearsonr
- Pearson’s correlation coefficient on Wikipedia

Tests whether two samples have a monotonic relationship.

Assumptions

- Observations in each sample are independent and identically distributed (iid).
- Observations in each sample can be ranked.

Interpretation

- H0: the two samples are independent.
- H1: there is a dependency between the samples.

Python Code

# Example of the Spearman's Rank Correlation Test from scipy.stats import spearmanr data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869] data2 = [0.353, 3.517, 0.125, -7.545, -0.555, -1.536, 3.350, -1.578, -3.537, -1.579] stat, p = spearmanr(data1, data2) print('stat=%.3f, p=%.3f' % (stat, p)) if p > 0.05: print('Probably independent') else: print('Probably dependent')

More Information

- How to Calculate Nonparametric Rank Correlation in Python
- scipy.stats.spearmanr
- Spearman’s rank correlation coefficient on Wikipedia

Tests whether two samples have a monotonic relationship.

Assumptions

- Observations in each sample are independent and identically distributed (iid).
- Observations in each sample can be ranked.

Interpretation

- H0: the two samples are independent.
- H1: there is a dependency between the samples.

Python Code

# Example of the Kendall's Rank Correlation Test from scipy.stats import kendalltau data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869] data2 = [0.353, 3.517, 0.125, -7.545, -0.555, -1.536, 3.350, -1.578, -3.537, -1.579] stat, p = kendalltau(data1, data2) print('stat=%.3f, p=%.3f' % (stat, p)) if p > 0.05: print('Probably independent') else: print('Probably dependent')

More Information

- How to Calculate Nonparametric Rank Correlation in Python
- scipy.stats.kendalltau
- Kendall rank correlation coefficient on Wikipedia

Tests whether two categorical variables are related or independent.

Assumptions

- Observations used in the calculation of the contingency table are independent.
- 25 or more examples in each cell of the contingency table.

Interpretation

- H0: the two samples are independent.
- H1: there is a dependency between the samples.

Python Code

# Example of the Chi-Squared Test from scipy.stats import chi2_contingency table = [[10, 20, 30],[6, 9, 17]] stat, p, dof, expected = chi2_contingency(table) print('stat=%.3f, p=%.3f' % (stat, p)) if p > 0.05: print('Probably independent') else: print('Probably dependent')

More Information

- A Gentle Introduction to the Chi-Squared Test for Machine Learning
- scipy.stats.chi2_contingency
- Chi-Squared test on Wikipedia

This section lists statistical tests that you can use to check if a time series is stationary or not.

Tests whether a time series has a unit root, e.g. has a trend or more generally is autoregressive.

Assumptions

- Observations in are temporally ordered.

Interpretation

- H0: a unit root is present (series is non-stationary).
- H1: a unit root is not present (series is stationary).

Python Code

# Example of the Augmented Dickey-Fuller unit root test from statsmodels.tsa.stattools import adfuller data = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] stat, p, lags, obs, crit, t = adfuller(data) print('stat=%.3f, p=%.3f' % (stat, p)) if p > 0.05: print('Probably not Stationary') else: print('Probably Stationary')

More Information

- How to Check if Time Series Data is Stationary with Python
- statsmodels.tsa.stattools.adfuller API.
- Augmented Dickey–Fuller test, Wikipedia.

Tests whether a time series is trend stationary or not.

Assumptions

- Observations in are temporally ordered.

Interpretation

- H0: the time series is not trend-stationary.
- H1: the time series is trend-stationary.

Python Code

# Example of the Kwiatkowski-Phillips-Schmidt-Shin test from statsmodels.tsa.stattools import kpss data = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] stat, p, lags, crit = kpss(data) print('stat=%.3f, p=%.3f' % (stat, p)) if p > 0.05: print('Probably not Stationary') else: print('Probably Stationary')

More Information

This section lists statistical tests that you can use to compare data samples.

Tests whether the means of two independent samples are significantly different.

Assumptions

- Observations in each sample are independent and identically distributed (iid).
- Observations in each sample are normally distributed.
- Observations in each sample have the same variance.

Interpretation

- H0: the means of the samples are equal.
- H1: the means of the samples are unequal.

Python Code

# Example of the Student's t-test from scipy.stats import ttest_ind data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869] data2 = [1.142, -0.432, -0.938, -0.729, -0.846, -0.157, 0.500, 1.183, -1.075, -0.169] stat, p = ttest_ind(data1, data2) print('stat=%.3f, p=%.3f' % (stat, p)) if p > 0.05: print('Probably the same distribution') else: print('Probably different distributions')

More Information

- How to Calculate Parametric Statistical Hypothesis Tests in Python
- scipy.stats.ttest_ind
- Student’s t-test on Wikipedia

Tests whether the means of two paired samples are significantly different.

Assumptions

- Observations in each sample are independent and identically distributed (iid).
- Observations in each sample are normally distributed.
- Observations in each sample have the same variance.
- Observations across each sample are paired.

Interpretation

- H0: the means of the samples are equal.
- H1: the means of the samples are unequal.

Python Code

# Example of the Paired Student's t-test from scipy.stats import ttest_rel data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869] data2 = [1.142, -0.432, -0.938, -0.729, -0.846, -0.157, 0.500, 1.183, -1.075, -0.169] stat, p = ttest_rel(data1, data2) print('stat=%.3f, p=%.3f' % (stat, p)) if p > 0.05: print('Probably the same distribution') else: print('Probably different distributions')

More Information

- How to Calculate Parametric Statistical Hypothesis Tests in Python
- scipy.stats.ttest_rel
- Student’s t-test on Wikipedia

Tests whether the means of two or more independent samples are significantly different.

Assumptions

- Observations in each sample are independent and identically distributed (iid).
- Observations in each sample are normally distributed.
- Observations in each sample have the same variance.

Interpretation

- H0: the means of the samples are equal.
- H1: one or more of the means of the samples are unequal.

Python Code

# Example of the Analysis of Variance Test from scipy.stats import f_oneway data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869] data2 = [1.142, -0.432, -0.938, -0.729, -0.846, -0.157, 0.500, 1.183, -1.075, -0.169] data3 = [-0.208, 0.696, 0.928, -1.148, -0.213, 0.229, 0.137, 0.269, -0.870, -1.204] stat, p = f_oneway(data1, data2, data3) print('stat=%.3f, p=%.3f' % (stat, p)) if p > 0.05: print('Probably the same distribution') else: print('Probably different distributions')

More Information

- How to Calculate Parametric Statistical Hypothesis Tests in Python
- scipy.stats.f_oneway
- Analysis of variance on Wikipedia

Tests whether the means of two or more paired samples are significantly different.

Assumptions

- Observations in each sample are independent and identically distributed (iid).
- Observations in each sample are normally distributed.
- Observations in each sample have the same variance.
- Observations across each sample are paired.

Interpretation

- H0: the means of the samples are equal.
- H1: one or more of the means of the samples are unequal.

Python Code

Currently not supported in Python.

More Information

- How to Calculate Parametric Statistical Hypothesis Tests in Python
- Analysis of variance on Wikipedia

Tests whether the distributions of two independent samples are equal or not.

Assumptions

- Observations in each sample are independent and identically distributed (iid).
- Observations in each sample can be ranked.

Interpretation

- H0: the distributions of both samples are equal.
- H1: the distributions of both samples are not equal.

Python Code

# Example of the Mann-Whitney U Test from scipy.stats import mannwhitneyu data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869] data2 = [1.142, -0.432, -0.938, -0.729, -0.846, -0.157, 0.500, 1.183, -1.075, -0.169] stat, p = mannwhitneyu(data1, data2) print('stat=%.3f, p=%.3f' % (stat, p)) if p > 0.05: print('Probably the same distribution') else: print('Probably different distributions')

More Information

- How to Calculate Nonparametric Statistical Hypothesis Tests in Python
- scipy.stats.mannwhitneyu
- Mann-Whitney U test on Wikipedia

Tests whether the distributions of two paired samples are equal or not.

Assumptions

- Observations in each sample are independent and identically distributed (iid).
- Observations in each sample can be ranked.
- Observations across each sample are paired.

Interpretation

- H0: the distributions of both samples are equal.
- H1: the distributions of both samples are not equal.

Python Code

# Example of the Wilcoxon Signed-Rank Test from scipy.stats import wilcoxon data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869] data2 = [1.142, -0.432, -0.938, -0.729, -0.846, -0.157, 0.500, 1.183, -1.075, -0.169] stat, p = wilcoxon(data1, data2) print('stat=%.3f, p=%.3f' % (stat, p)) if p > 0.05: print('Probably the same distribution') else: print('Probably different distributions')

More Information

- How to Calculate Nonparametric Statistical Hypothesis Tests in Python
- scipy.stats.wilcoxon
- Wilcoxon signed-rank test on Wikipedia

Tests whether the distributions of two or more independent samples are equal or not.

Assumptions

- Observations in each sample are independent and identically distributed (iid).
- Observations in each sample can be ranked.

Interpretation

- H0: the distributions of all samples are equal.
- H1: the distributions of one or more samples are not equal.

Python Code

# Example of the Kruskal-Wallis H Test from scipy.stats import kruskal data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869] data2 = [1.142, -0.432, -0.938, -0.729, -0.846, -0.157, 0.500, 1.183, -1.075, -0.169] stat, p = kruskal(data1, data2) print('stat=%.3f, p=%.3f' % (stat, p)) if p > 0.05: print('Probably the same distribution') else: print('Probably different distributions')

More Information

- How to Calculate Nonparametric Statistical Hypothesis Tests in Python
- scipy.stats.kruskal
- Kruskal-Wallis one-way analysis of variance on Wikipedia

Tests whether the distributions of two or more paired samples are equal or not.

Assumptions

- Observations in each sample are independent and identically distributed (iid).
- Observations in each sample can be ranked.
- Observations across each sample are paired.

Interpretation

- H0: the distributions of all samples are equal.
- H1: the distributions of one or more samples are not equal.

Python Code

# Example of the Friedman Test from scipy.stats import friedmanchisquare data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869] data2 = [1.142, -0.432, -0.938, -0.729, -0.846, -0.157, 0.500, 1.183, -1.075, -0.169] data3 = [-0.208, 0.696, 0.928, -1.148, -0.213, 0.229, 0.137, 0.269, -0.870, -1.204] stat, p = friedmanchisquare(data1, data2, data3) print('stat=%.3f, p=%.3f' % (stat, p)) if p > 0.05: print('Probably the same distribution') else: print('Probably different distributions')

More Information

- How to Calculate Nonparametric Statistical Hypothesis Tests in Python
- scipy.stats.friedmanchisquare
- Friedman test on Wikipedia

This section provides more resources on the topic if you are looking to go deeper.

- A Gentle Introduction to Normality Tests in Python
- How to Use Correlation to Understand the Relationship Between Variables
- How to Use Parametric Statistical Significance Tests in Python
- A Gentle Introduction to Statistical Hypothesis Tests

In this tutorial, you discovered the key statistical hypothesis tests that you may need to use in a machine learning project.

Specifically, you learned:

- The types of tests to use in different circumstances, such as normality checking, relationships between variables, and differences between samples.
- The key assumptions for each test and how to interpret the test result.
- How to implement the test using the Python API.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

Did I miss an important statistical test or key assumption for one of the listed tests?

Let me know in the comments below.

The post 17 Statistical Hypothesis Tests in Python (Cheat Sheet) appeared first on Machine Learning Mastery.

]]>The post Statistics for Machine Learning (7-Day Mini-Course) appeared first on Machine Learning Mastery.

]]>Statistics is a field of mathematics that is universally agreed to be a prerequisite for a deeper understanding of machine learning.

Although statistics is a large field with many esoteric theories and findings, the nuts and bolts tools and notations taken from the field are required for machine learning practitioners. With a solid foundation of what statistics is, it is possible to focus on just the good or relevant parts.

In this crash course, you will discover how you can get started and confidently read and implement statistical methods used in machine learning with Python in seven days.

This is a big and important post. You might want to bookmark it.

**Kick-start your project** with my new book Statistics for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

Before we get started, let’s make sure you are in the right place.

This course is for developers that may know some applied machine learning. Maybe you know how to work through a predictive modeling problem end-to-end, or at least most of the main steps, with popular tools.

The lessons in this course do assume a few things about you, such as:

- You know your way around basic Python for programming.
- You may know some basic NumPy for array manipulation.
- You want to learn statistics to deepen your understanding and application of machine learning.

You do NOT need to know:

- You do not need to be a math wiz!
- You do not need to be a machine learning expert!

This crash course will take you from a developer that knows a little machine learning to a developer who can navigate the basics of statistical methods.

Note: This crash course assumes you have a working Python3 SciPy environment with at least NumPy installed. If you need help with your environment, you can follow the step-by-step tutorial here:

This crash course is broken down into seven lessons.

You could complete one lesson per day (recommended) or complete all of the lessons in one day (hardcore). It really depends on the time you have available and your level of enthusiasm.

Below is a list of the seven lessons that will get you started and productive with statistics for machine learning in Python:

**Lesson 01**: Statistics and Machine Learning**Lesson 02**: Introduction to Statistics**Lesson 03**: Gaussian Distribution and Descriptive Stats**Lesson 04**: Correlation Between Variables**Lesson 05**: Statistical Hypothesis Tests**Lesson 06**: Estimation Statistics**Lesson 07**: Nonparametric Statistics

Each lesson could take you 60 seconds or up to 30 minutes. Take your time and complete the lessons at your own pace. Ask questions and even post results in the comments below.

The lessons expect you to go off and find out how to do things. I will give you hints, but part of the point of each lesson is to force you to learn where to go to look for help on and about the statistical methods and the NumPy API and the best-of-breed tools in Python (hint: I have all of the answers directly on this blog; use the search box).

Post your results in the comments; I’ll cheer you on!

Hang in there; don’t give up.

Note: This is just a crash course. For a lot more detail and fleshed-out tutorials, see my book on the topic titled “Statistical Methods for Machine Learning.”

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

In this lesson, you will discover the five reasons why a machine learning practitioner should deepen their understanding of statistics.

Statistical methods are required in the preparation of train and test data for your machine learning model.

This includes techniques for:

- Outlier detection.
- Missing value imputation.
- Data sampling.
- Data scaling.
- Variable encoding.

And much more.

A basic understanding of data distributions, descriptive statistics, and data visualization is required to help you identify the methods to choose when performing these tasks.

Statistical methods are required when evaluating the skill of a machine learning model on data not seen during training.

This includes techniques for:

- Data sampling.
- Data resampling.
- Experimental design.

Resampling techniques such as k-fold cross-validation are often well understood by machine learning practitioners, but the rationale for why this method is required is not.

Statistical methods are required when selecting a final model or model configuration to use for a predictive modeling problem.

These include techniques for:

- Checking for a significant difference between results.
- Quantifying the size of the difference between results.

This might include the use of statistical hypothesis tests.

Statistical methods are required when presenting the skill of a final model to stakeholders.

This includes techniques for:

- Summarizing the expected skill of the model on average.
- Quantifying the expected variability of the skill of the model in practice.

This might include estimation statistics such as confidence intervals.

Statistical methods are required when making a prediction with a finalized model on new data.

This includes techniques for:

- Quantifying the expected variability for the prediction.

This might include estimation statistics such as prediction intervals.

For this lesson, you must list three reasons why you personally want to learn statistics.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will discover a concise definition of statistics.

In this lesson, you will discover a concise definition of statistics.

Statistics is a required prerequisite for most books and courses on applied machine learning. But what exactly is statistics?

Statistics is a subfield of mathematics. It refers to a collection of methods for working with data and using data to answer questions.

It is because the field is comprised of a grab bag of methods for working with data that it can seem large and amorphous to beginners. It can be hard to see the line between methods that belong to statistics and methods that belong to other fields of study.

When it comes to the statistical tools that we use in practice, it can be helpful to divide the field of statistics into two large groups of methods: descriptive statistics for summarizing data, and inferential statistics for drawing conclusions from samples of data.

**Descriptive Statistics**: Descriptive statistics refer to methods for summarizing raw observations into information that we can understand and share.**Inferential Statistics**: Inferential statistics is a fancy name for methods that aid in quantifying properties of the domain or population from a smaller set of obtained observations called a sample.

For this lesson, you must list three methods that can be used for each descriptive and inferential statistics.

Post your answer in the comments below. I would love to see what you discover.

In the next lesson, you will discover the Gaussian distribution and how to calculate summary statistics.

In this lesson, you will discover the Gaussian distribution for data and how to calculate simple descriptive statistics.

A sample of data is a snapshot from a broader population of all possible observations that could be taken from a domain or generated by a process.

Interestingly, many observations fit a common pattern or distribution called the normal distribution, or more formally, the Gaussian distribution. It is the bell-shaped distribution that you may be familiar with.

A lot is known about the Gaussian distribution, and as such, there are whole sub-fields of statistics and statistical methods that can be used with Gaussian data.

Any Gaussian distribution, and in turn any data sample drawn from a Gaussian distribution, can be summarized with just two parameters:

**Mean**. The central tendency or most likely value in the distribution (the top of the bell).**Variance**. The average difference that observations have from the mean value in the distribution (the spread).

The units of the mean are the same as the units of the distribution, although the units of the variance are squared, and therefore harder to interpret. A popular alternative to the variance parameter is the **standard deviation**, which is simply the square root of the variance, returning the units to be the same as those of the distribution.

The mean, variance, and standard deviation can be calculated directly on data samples in NumPy.

The example below generates a sample of 100 random numbers drawn from a Gaussian distribution with a known mean of 50 and a standard deviation of 5 and calculates the summary statistics.

# calculate summary stats from numpy.random import seed from numpy.random import randn from numpy import mean from numpy import var from numpy import std # seed the random number generator seed(1) # generate univariate observations data = 5 * randn(10000) + 50 # calculate statistics print('Mean: %.3f' % mean(data)) print('Variance: %.3f' % var(data)) print('Standard Deviation: %.3f' % std(data))

Run the example and compare the estimated mean and standard deviation from the expected values.

For this lesson, you must implement the calculation of one descriptive statistic from scratch in Python, such as the calculation of a sample mean.

Post your answer in the comments below. I would love to see what you discover.

In the next lesson, you will discover how to quantify the relationship between two variables.

In this lesson, you will discover how to calculate a correlation coefficient to quantify the relationship between two variables.

Variables in a dataset may be related for lots of reasons.

It can be useful in data analysis and modeling to better understand the relationships between variables. The statistical relationship between two variables is referred to as their correlation.

A correlation could be positive, meaning both variables move in the same direction, or negative, meaning that when one variable’s value increases, the other variables’ values decrease.

**Positive Correlation**: Both variables change in the same direction.**Neutral Correlation**: No relationship in the change of the variables.**Negative Correlation**: Variables change in opposite directions.

The performance of some algorithms can deteriorate if two or more variables are tightly related, called multicollinearity. An example is linear regression, where one of the offending correlated variables should be removed in order to improve the skill of the model.

We can quantify the relationship between samples of two variables using a statistical method called Pearson’s correlation coefficient, named for the developer of the method, Karl Pearson.

The *pearsonr()* NumPy function can be used to calculate the Pearson’s correlation coefficient for samples of two variables.

The complete example is listed below showing the calculation where one variable is dependent upon the second.

# calculate correlation coefficient from numpy.random import seed from numpy.random import randn from scipy.stats import pearsonr # seed random number generator seed(1) # prepare data data1 = 20 * randn(1000) + 100 data2 = data1 + (10 * randn(1000) + 50) # calculate Pearson's correlation corr, p = pearsonr(data1, data2) # display the correlation print('Pearsons correlation: %.3f' % corr)

Run the example and review the calculated correlation coefficient.

For this lesson, you must load a standard machine learning dataset and calculate the correlation between each pair of numerical variables.

Post your answer in the comments below. I would love to see what you discover.

In the next lesson, you will discover statistical hypothesis tests.

In this lesson, you will discover statistical hypothesis tests and how to compare two samples.

Data must be interpreted in order to add meaning. We can interpret data by assuming a specific structure our outcome and use statistical methods to confirm or reject the assumption.

The assumption is called a hypothesis and the statistical tests used for this purpose are called statistical hypothesis tests.

The assumption of a statistical test is called the null hypothesis, or hypothesis zero (H0 for short). It is often called the default assumption, or the assumption that nothing has changed. A violation of the test’s assumption is often called the first hypothesis, hypothesis one, or H1 for short.

**Hypothesis 0 (H0)**: Assumption of the test holds and is failed to be rejected.**Hypothesis 1 (H1)**: Assumption of the test does not hold and is rejected at some level of significance.

We can interpret the result of a statistical hypothesis test using a p-value.

The p-value is the probability of observing the data, given the null hypothesis is true.

A large probability means that the H0 or default assumption is likely. A small value, such as below 5% (o.05) suggests that it is not likely and that we can reject H0 in favor of H1, or that something is likely to be different (e.g. a significant result).

A widely used statistical hypothesis test is the Student’s t-test for comparing the mean values from two independent samples.

The default assumption is that there is no difference between the samples, whereas a rejection of this assumption suggests some significant difference. The tests assumes that both samples were drawn from a Gaussian distribution and have the same variance.

The Student’s t-test can be implemented in Python via the *ttest_ind()* SciPy function.

Below is an example of calculating and interpreting the Student’s t-test for two data samples that are known to be different.

# student's t-test from numpy.random import seed from numpy.random import randn from scipy.stats import ttest_ind # seed the random number generator seed(1) # generate two independent samples data1 = 5 * randn(100) + 50 data2 = 5 * randn(100) + 51 # compare samples stat, p = ttest_ind(data1, data2) print('Statistics=%.3f, p=%.3f' % (stat, p)) # interpret alpha = 0.05 if p > alpha: print('Same distributions (fail to reject H0)') else: print('Different distributions (reject H0)')

Run the code and review the calculated statistic and interpretation of the p-value.

For this lesson, you must list three other statistical hypothesis tests that can be used to check for differences between samples.

Post your answer in the comments below. I would love to see what you discover.

In the next lesson, you will discover estimation statistics as an alternative to statistical hypothesis testing.

In this lesson, you will discover estimation statistics that may be used as an alternative to statistical hypothesis tests.

Statistical hypothesis tests can be used to indicate whether the difference between two samples is due to random chance, but cannot comment on the size of the difference.

A group of methods referred to as “*new statistics*” are seeing increased use instead of or in addition to p-values in order to quantify the magnitude of effects and the amount of uncertainty for estimated values. This group of statistical methods is referred to as estimation statistics.

Estimation statistics is a term to describe three main classes of methods. The three main

classes of methods include:

**Effect Size**. Methods for quantifying the size of an effect given a treatment or intervention.**Interval Estimation**. Methods for quantifying the amount of uncertainty in a value.**Meta-Analysis**. Methods for quantifying the findings across multiple similar studies.

Of the three, perhaps the most useful methods in applied machine learning are interval estimation.

There are three main types of intervals. They are:

**Tolerance Interval**: The bounds or coverage of a proportion of a distribution with a specific level of confidence.**Confidence Interval**: The bounds on the estimate of a population parameter.**Prediction Interval**: The bounds on a single observation.

A simple way to calculate a confidence interval for a classification algorithm is to calculate the binomial proportion confidence interval, which can provide an interval around a model’s estimated accuracy or error.

This can be implemented in Python using the *confint()* Statsmodels function.

The function takes the count of successes (or failures), the total number of trials, and the significance level as arguments and returns the lower and upper bound of the confidence interval.

The example below demonstrates this function in a hypothetical case where a model made 88 correct predictions out of a dataset with 100 instances and we are interested in the 95% confidence interval (provided to the function as a significance of 0.05).

# calculate the confidence interval from statsmodels.stats.proportion import proportion_confint # calculate the interval lower, upper = proportion_confint(88, 100, 0.05) print('lower=%.3f, upper=%.3f' % (lower, upper))

Run the example and review the confidence interval on the estimated accuracy.

For this lesson, you must list two methods for calculating the effect size in applied machine learning and when they might be useful.

As a hint, consider one for the relationship between variables and one for the difference between samples.

Post your answer in the comments below. I would love to see what you discover.

In the next lesson, you will discover nonparametric statistical methods.

In this lesson, you will discover statistical methods that may be used when your data does not come from a Gaussian distribution.

A large portion of the field of statistics and statistical methods is dedicated to data where the distribution is known.

Data in which the distribution is unknown or cannot be easily identified is called nonparametric.

In the case where you are working with nonparametric data, specialized nonparametric statistical methods can be used that discard all information about the distribution. As such, these methods are often referred to as distribution-free methods.

Before a nonparametric statistical method can be applied, the data must be converted into a rank format. As such, statistical methods that expect data in rank format are sometimes called rank statistics, such as rank correlation and rank statistical hypothesis tests. Ranking data is exactly as its name suggests.

The procedure is as follows:

- Sort all data in the sample in ascending order.
- Assign an integer rank from 1 to N for each unique value in the data sample.

A widely used nonparametric statistical hypothesis test for checking for a difference between two independent samples is the Mann-Whitney U test, named for Henry Mann and Donald Whitney.

It is the nonparametric equivalent of the Student’s t-test but does not assume that the data is drawn from a Gaussian distribution.

The test can be implemented in Python via the *mannwhitneyu()* SciPy function.

The example below demonstrates the test on two data samples drawn from a uniform distribution known to be different.

# example of the mann-whitney u test from numpy.random import seed from numpy.random import rand from scipy.stats import mannwhitneyu # seed the random number generator seed(1) # generate two independent samples data1 = 50 + (rand(100) * 10) data2 = 51 + (rand(100) * 10) # compare samples stat, p = mannwhitneyu(data1, data2) print('Statistics=%.3f, p=%.3f' % (stat, p)) # interpret alpha = 0.05 if p > alpha: print('Same distribution (fail to reject H0)') else: print('Different distribution (reject H0)')

Run the example and review the calculated statistics and interpretation of the p-value.

For this lesson, you must list three additional nonparametric statistical methods.

Post your answer in the comments below. I would love to see what you discover.

This was the final lesson in the mini-course.

(Look How Far You Have Come)

You made it. Well done!

Take a moment and look back at how far you have come.

You discovered:

- The importance of statistics in applied machine learning.
- A concise definition of statistics and a division of methods into two main types.
- The Gaussian distribution and how to describe data with this distribution using statistics.
- How to quantify the relationship between the samples of two variables.
- How to check for the difference between two samples using statistical hypothesis tests.
- An alternative to statistical hypothesis tests called estimation statistics.
- Nonparametric methods that can be used when data is not drawn from the Gaussian distribution.

This is just the beginning of your journey with statistics for machine learning. Keep practicing and developing your skills.

Take the next step and check out my book on Statistical Methods for Machine Learning.

How did you do with the mini-course?

Did you enjoy this crash course?

Do you have any questions? Were there any sticking points?

Let me know. Leave a comment below.

The post Statistics for Machine Learning (7-Day Mini-Course) appeared first on Machine Learning Mastery.

]]>The post How to Code the Student’s t-Test from Scratch in Python appeared first on Machine Learning Mastery.

]]>Because you may use this test yourself someday, it is important to have a deep understanding of how the test works. As a developer, this understanding is best achieved by implementing the hypothesis test yourself from scratch.

In this tutorial, you will discover how to implement the Student’s t-test statistical hypothesis test from scratch in Python.

After completing this tutorial, you will know:

- The Student’s t-test will comment on whether it is likely to observe two samples given that the samples were drawn from the same population.
- How to implement the Student’s t-test from scratch for two independent samples.
- How to implement the paired Student’s t-test from scratch for two dependent samples.

**Kick-start your project** with my new book Statistics for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

This tutorial is divided into three parts; they are:

- Student’s t-Test
- Student’s t-Test for Independent Samples
- Student’s t-Test for Dependent Samples

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

The Student’s t-Test is a statistical hypothesis test for testing whether two samples are expected to have been drawn from the same population.

It is named for the pseudonym “*Student*” used by William Gosset, who developed the test.

The test works by checking the means from two samples to see if they are significantly different from each other. It does this by calculating the standard error in the difference between means, which can be interpreted to see how likely the difference is, if the two samples have the same mean (the null hypothesis).

The t statistic calculated by the test can be interpreted by comparing it to critical values from the t-distribution. The critical value can be calculated using the degrees of freedom and a significance level with the percent point function (PPF).

We can interpret the statistic value in a two-tailed test, meaning that if we reject the null hypothesis, it could be because the first mean is smaller or greater than the second mean. To do this, we can calculate the absolute value of the test statistic and compare it to the positive (right tailed) critical value, as follows:

**If abs(t-statistic) <= critical value**: Accept null hypothesis that the means are equal.**If abs(t-statistic) > critical value**: Reject the null hypothesis that the means are equal.

We can also retrieve the cumulative probability of observing the absolute value of the t-statistic using the cumulative distribution function (CDF) of the t-distribution in order to calculate a p-value. The p-value can then be compared to a chosen significance level (alpha) such as 0.05 to determine if the null hypothesis can be rejected:

**If p > alpha**: Accept null hypothesis that the means are equal.**If p <= alpha**: Reject null hypothesis that the means are equal.

In working with the means of the samples, the test assumes that both samples were drawn from a Gaussian distribution. The test also assumes that the samples have the same variance, and the same size, although there are corrections to the test if these assumptions do not hold. For example, see Welch’s t-test.

There are two main versions of Student’s t-test:

**Independent Samples**. The case where the two samples are unrelated.**Dependent Samples**. The case where the samples are related, such as repeated measures on the same population. Also called a paired test.

Both the independent and the dependent Student’s t-tests are available in Python via the ttest_ind() and ttest_rel() SciPy functions respectively.

**Note**: I recommend using these SciPy functions to calculate the Student’s t-test for your applications, if they are suitable. The library implementations will be faster and less prone to bugs. I would only recommend implementing the test yourself for learning purposes or in the case where you require a modified version of the test.

We will use the SciPy functions to confirm the results from our own version of the tests.

Note, for reference, all calculations presented in this tutorial are taken directly from Chapter 9 “*t Tests*” in “Statistics in Plain English“, Third Edition, 2010. I mention this because you may see the equations with different forms, depending on the reference text that you use.

We’ll start with the most common form of the Student’s t-test: the case where we are comparing the means of two independent samples.

The calculation of the t-statistic for two independent samples is as follows:

t = observed difference between sample means / standard error of the difference between the means

or

t = (mean(X1) - mean(X2)) / sed

Where *X1* and *X2* are the first and second data samples and *sed* is the standard error of the difference between the means.

The standard error of the difference between the means can be calculated as follows:

sed = sqrt(se1^2 + se2^2)

Where *se1* and *se2* are the standard errors for the first and second datasets.

The standard error of a sample can be calculated as:

se = std / sqrt(n)

Where *se* is the standard error of the sample, *std* is the sample standard deviation, and *n* is the number of observations in the sample.

These calculations make the following assumptions:

- The samples are drawn from a Gaussian distribution.
- The size of each sample is approximately equal.
- The samples have the same variance.

We can implement these equations easily using functions from the Python standard library, NumPy and SciPy.

Let’s assume that our two data samples are stored in the variables *data1* and *data2*.

We can start off by calculating the mean for these samples as follows:

# calculate means mean1, mean2 = mean(data1), mean(data2)

We’re halfway there.

Now we need to calculate the standard error.

We can do this manually, first by calculating the sample standard deviations:

# calculate sample standard deviations std1, std2 = std(data1, ddof=1), std(data2, ddof=1)

And then the standard errors:

# calculate standard errors n1, n2 = len(data1), len(data2) se1, se2 = std1/sqrt(n1), std2/sqrt(n2)

Alternately, we can use the *sem()* SciPy function to calculate the standard error directly.

# calculate standard errors se1, se2 = sem(data1), sem(data2)

We can use the standard errors of the samples to calculate the “*standard error on the difference between the samples*“:

# standard error on the difference between the samples sed = sqrt(se1**2.0 + se2**2.0)

We can now calculate the t statistic:

# calculate the t statistic t_stat = (mean1 - mean2) / sed

We can also calculate some other values to help interpret and present the statistic.

The number of degrees of freedom for the test is calculated as the sum of the observations in both samples, minus two.

# degrees of freedom df = n1 + n2 - 2

The critical value can be calculated using the percent point function (PPF) for a given significance level, such as 0.05 (95% confidence).

This function is available for the t distribution in SciPy, as follows:

# calculate the critical value alpha = 0.05 cv = t.ppf(1.0 - alpha, df)

The p-value can be calculated using the cumulative distribution function on the t-distribution, again in SciPy.

# calculate the p-value p = (1 - t.cdf(abs(t_stat), df)) * 2

Here, we assume a two-tailed distribution, where the rejection of the null hypothesis could be interpreted as the first mean is either smaller or larger than the second mean.

We can tie all of these pieces together into a simple function for calculating the t-test for two independent samples:

# function for calculating the t-test for two independent samples def independent_ttest(data1, data2, alpha): # calculate means mean1, mean2 = mean(data1), mean(data2) # calculate standard errors se1, se2 = sem(data1), sem(data2) # standard error on the difference between the samples sed = sqrt(se1**2.0 + se2**2.0) # calculate the t statistic t_stat = (mean1 - mean2) / sed # degrees of freedom df = len(data1) + len(data2) - 2 # calculate the critical value cv = t.ppf(1.0 - alpha, df) # calculate the p-value p = (1.0 - t.cdf(abs(t_stat), df)) * 2.0 # return everything return t_stat, df, cv, p

In this section we will calculate the t-test on some synthetic data samples.

First, let’s generate two samples of 100 Gaussian random numbers with the same variance of 5 and differing means of 50 and 51 respectively. We will expect the test to reject the null hypothesis and find a significant difference between the samples:

# seed the random number generator seed(1) # generate two independent samples data1 = 5 * randn(100) + 50 data2 = 5 * randn(100) + 51

We can calculate the t-test on these samples using the built in SciPy function *ttest_ind()*. This will give us a t-statistic value and a p-value to compare to, to ensure that we have implemented the test correctly.

The complete example is listed below.

# Student's t-test for independent samples from numpy.random import seed from numpy.random import randn from scipy.stats import ttest_ind # seed the random number generator seed(1) # generate two independent samples data1 = 5 * randn(100) + 50 data2 = 5 * randn(100) + 51 # compare samples stat, p = ttest_ind(data1, data2) print('t=%.3f, p=%.3f' % (stat, p))

Running the example, we can see a t-statistic value and p value.

We will use these as our expected values for the test on these data.

t=-2.262, p=0.025

We can now apply our own implementation on the same data, using the function defined in the previous section.

The function will return a t-statistic value and a critical value. We can use the critical value to interpret the t statistic to see if the finding of the test is significant and that indeed the means are different as we expected.

# interpret via critical value if abs(t_stat) <= cv: print('Accept null hypothesis that the means are equal.') else: print('Reject the null hypothesis that the means are equal.')

The function also returns a p-value. We can interpret the p-value using an alpha, such as 0.05 to determine if the finding of the test is significant and that indeed the means are different as we expected.

# interpret via p-value if p > alpha: print('Accept null hypothesis that the means are equal.') else: print('Reject the null hypothesis that the means are equal.')

We expect that both interpretations will always match.

The complete example is listed below.

# t-test for independent samples from math import sqrt from numpy.random import seed from numpy.random import randn from numpy import mean from scipy.stats import sem from scipy.stats import t # function for calculating the t-test for two independent samples def independent_ttest(data1, data2, alpha): # calculate means mean1, mean2 = mean(data1), mean(data2) # calculate standard errors se1, se2 = sem(data1), sem(data2) # standard error on the difference between the samples sed = sqrt(se1**2.0 + se2**2.0) # calculate the t statistic t_stat = (mean1 - mean2) / sed # degrees of freedom df = len(data1) + len(data2) - 2 # calculate the critical value cv = t.ppf(1.0 - alpha, df) # calculate the p-value p = (1.0 - t.cdf(abs(t_stat), df)) * 2.0 # return everything return t_stat, df, cv, p # seed the random number generator seed(1) # generate two independent samples data1 = 5 * randn(100) + 50 data2 = 5 * randn(100) + 51 # calculate the t test alpha = 0.05 t_stat, df, cv, p = independent_ttest(data1, data2, alpha) print('t=%.3f, df=%d, cv=%.3f, p=%.3f' % (t_stat, df, cv, p)) # interpret via critical value if abs(t_stat) <= cv: print('Accept null hypothesis that the means are equal.') else: print('Reject the null hypothesis that the means are equal.') # interpret via p-value if p > alpha: print('Accept null hypothesis that the means are equal.') else: print('Reject the null hypothesis that the means are equal.')

Running the example first calculates the test.

The results of the test are printed, including the t-statistic, the degrees of freedom, the critical value, and the p-value.

We can see that both the t-statistic and p-value match the outputs of the SciPy function. The test appears to be implemented correctly.

The t-statistic and the p-value are then used to interpret the results of the test. We find that as we expect, there is sufficient evidence to reject the null hypothesis, finding that the sample means are likely different.

t=-2.262, df=198, cv=1.653, p=0.025 Reject the null hypothesis that the means are equal. Reject the null hypothesis that the means are equal.

We can now look at the case of calculating the Student’s t-test for dependent samples.

This is the case where we collect some observations on a sample from the population, then apply some treatment, and then collect observations from the same sample.

The result is two samples of the same size where the observations in each sample are related or paired.

The t-test for dependent samples is referred to as the paired Student’s t-test.

The calculation of the paired Student’s t-test is similar to the case with independent samples.

The main difference is in the calculation of the denominator.

t = (mean(X1) - mean(X2)) / sed

Where *X1* and *X2* are the first and second data samples and *sed* is the standard error of the difference between the means.

Here, *sed* is calculated as:

sed = sd / sqrt(n)

Where *sd* is the standard deviation of the difference between the dependent sample means and *n* is the total number of paired observations (e.g. the size of each sample).

The calculation of *sd* first requires the calculation of the sum of the squared differences between the samples:

d1 = sum (X1[i] - X2[i])^2 for i in n

It also requires the sum of the (non squared) differences between the samples:

d2 = sum (X1[i] - X2[i]) for i in n

We can then calculate sd as:

sd = sqrt((d1 - (d2**2 / n)) / (n - 1))

That’s it.

We can implement the calculation of the paired Student’s t-test directly in Python.

The first step is to calculate the means of each sample.

# calculate means mean1, mean2 = mean(data1), mean(data2)

Next, we will require the number of pairs (*n*). We will use this in a few different calculations.

# number of paired samples n = len(data1)

Next, we must calculate the sum of the squared differences between the samples, as well as the sum differences.

# sum squared difference between observations d1 = sum([(data1[i]-data2[i])**2 for i in range(n)]) # sum difference between observations d2 = sum([data1[i]-data2[i] for i in range(n)])

We can now calculate the standard deviation of the difference between means.

# standard deviation of the difference between means sd = sqrt((d1 - (d2**2 / n)) / (n - 1))

This is then used to calculate the standard error of the difference between the means.

# standard error of the difference between the means sed = sd / sqrt(n)

Finally, we have everything we need to calculate the t statistic.

# calculate the t statistic t_stat = (mean1 - mean2) / sed

The only other key difference between this implementation and the implementation for independent samples is the calculation of the number of degrees of freedom.

# degrees of freedom df = n - 1

As before, we can tie all of this together into a reusable function. The function will take two paired samples and a significance level (alpha) and calculate the t-statistic, number of degrees of freedom, critical value, and p-value.

The complete function is listed below.

# function for calculating the t-test for two dependent samples def dependent_ttest(data1, data2, alpha): # calculate means mean1, mean2 = mean(data1), mean(data2) # number of paired samples n = len(data1) # sum squared difference between observations d1 = sum([(data1[i]-data2[i])**2 for i in range(n)]) # sum difference between observations d2 = sum([data1[i]-data2[i] for i in range(n)]) # standard deviation of the difference between means sd = sqrt((d1 - (d2**2 / n)) / (n - 1)) # standard error of the difference between the means sed = sd / sqrt(n) # calculate the t statistic t_stat = (mean1 - mean2) / sed # degrees of freedom df = n - 1 # calculate the critical value cv = t.ppf(1.0 - alpha, df) # calculate the p-value p = (1.0 - t.cdf(abs(t_stat), df)) * 2.0 # return everything return t_stat, df, cv, p

In this section, we will use the same dataset in the worked example as we did for the independent Student’s t-test.

The data samples are not paired, but we will pretend they are. We expect the test to reject the null hypothesis and find a significant difference between the samples.

# seed the random number generator seed(1) # generate two independent samples data1 = 5 * randn(100) + 50 data2 = 5 * randn(100) + 51

As before, we can evaluate the test problem with the SciPy function for calculating a paired t-test. In this case, the *ttest_rel()* function.

The complete example is listed below.

# Paired Student's t-test from numpy.random import seed from numpy.random import randn from scipy.stats import ttest_rel # seed the random number generator seed(1) # generate two independent samples data1 = 5 * randn(100) + 50 data2 = 5 * randn(100) + 51 # compare samples stat, p = ttest_rel(data1, data2) print('Statistics=%.3f, p=%.3f' % (stat, p))

Running the example calculates and prints the t-statistic and the p-value.

We will use these values to validate the calculation of our own paired t-test function.

Statistics=-2.372, p=0.020

We can now test our own implementation of the paired Student’s t-test.

The complete example, including the developed function and interpretation of the results of the function, is listed below.

# t-test for dependent samples from math import sqrt from numpy.random import seed from numpy.random import randn from numpy import mean from scipy.stats import t # function for calculating the t-test for two dependent samples def dependent_ttest(data1, data2, alpha): # calculate means mean1, mean2 = mean(data1), mean(data2) # number of paired samples n = len(data1) # sum squared difference between observations d1 = sum([(data1[i]-data2[i])**2 for i in range(n)]) # sum difference between observations d2 = sum([data1[i]-data2[i] for i in range(n)]) # standard deviation of the difference between means sd = sqrt((d1 - (d2**2 / n)) / (n - 1)) # standard error of the difference between the means sed = sd / sqrt(n) # calculate the t statistic t_stat = (mean1 - mean2) / sed # degrees of freedom df = n - 1 # calculate the critical value cv = t.ppf(1.0 - alpha, df) # calculate the p-value p = (1.0 - t.cdf(abs(t_stat), df)) * 2.0 # return everything return t_stat, df, cv, p # seed the random number generator seed(1) # generate two independent samples (pretend they are dependent) data1 = 5 * randn(100) + 50 data2 = 5 * randn(100) + 51 # calculate the t test alpha = 0.05 t_stat, df, cv, p = dependent_ttest(data1, data2, alpha) print('t=%.3f, df=%d, cv=%.3f, p=%.3f' % (t_stat, df, cv, p)) # interpret via critical value if abs(t_stat) <= cv: print('Accept null hypothesis that the means are equal.') else: print('Reject the null hypothesis that the means are equal.') # interpret via p-value if p > alpha: print('Accept null hypothesis that the means are equal.') else: print('Reject the null hypothesis that the means are equal.')

Running the example calculates the paired t-test on the sample problem.

The calculated t-statistic and p-value match what we expect from the SciPy library implementation. This suggests that the implementation is correct.

The interpretation of the t-test statistic with the critical value, and the p-value with the significance level both find a significant result, rejecting the null hypothesis that the means are equal.

t=-2.372, df=99, cv=1.660, p=0.020 Reject the null hypothesis that the means are equal. Reject the null hypothesis that the means are equal.

This section lists some ideas for extending the tutorial that you may wish to explore.

- Apply each test to your own contrived sample problem.
- Update the independent test and add the correction for samples with different variances and sample sizes.
- Perform a code review of one of the tests implemented in the SciPy library and summarize the differences in the implementation details.

If you explore any of these extensions, I’d love to know.

This section provides more resources on the topic if you are looking to go deeper.

- Statistics in Plain English, Third Edition, 2010.

In this tutorial, you discovered how to implement the Student’s t-test statistical hypothesis test from scratch in Python.

Specifically, you learned:

- The Student’s t-test will comment on whether it is likely to observe two samples given that the samples were drawn from the same population.
- How to implement the Student’s t-test from scratch for two independent samples.
- How to implement the paired Student’s t-test from scratch for two dependent samples.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Code the Student’s t-Test from Scratch in Python appeared first on Machine Learning Mastery.

]]>The post How to Calculate McNemar’s Test to Compare Two Machine Learning Classifiers appeared first on Machine Learning Mastery.

]]>In his widely cited 1998 paper, Thomas Dietterich recommended the McNemar’s test in those cases where it is expensive or impractical to train multiple copies of classifier models.

This describes the current situation with deep learning models that are both very large and are trained and evaluated on large datasets, often requiring days or weeks to train a single model.

In this tutorial, you will discover how to use the McNemar’s statistical hypothesis test to compare machine learning classifier models on a single test dataset.

After completing this tutorial, you will know:

- The recommendation of the McNemar’s test for models that are expensive to train, which suits large deep learning models.
- How to transform prediction results from two classifiers into a contingency table and how the table is used to calculate the statistic in the McNemar’s test.
- How to calculate the McNemar’s test in Python and interpret and report the result.

**Kick-start your project** with my new book Statistics for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

This tutorial is divided into five parts; they are:

- Statistical Hypothesis Tests for Deep Learning
- Contingency Table
- McNemar’s Test Statistic
- Interpret the McNemar’s Test for Classifiers
- McNemar’s Test in Python

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

In his important and widely cited 1998 paper on the use of statistical hypothesis tests to compare classifiers titled “Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms“, Thomas Dietterich recommends the use of the McNemar’s test.

Specifically, the test is recommended in those cases where the algorithms that are being compared can only be evaluated once, e.g. on one test set, as opposed to repeated evaluations via a resampling technique, such as k-fold cross-validation.

For algorithms that can be executed only once, McNemar’s test is the only test with acceptable Type I error.

— Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithm, 1998.

Specifically, Dietterich’s study was concerned with the evaluation of different statistical hypothesis tests, some operating upon the results from resampling methods. The concern of the study was low Type I error, that is, the statistical test reporting an effect when in fact no effect was present (false positive).

Statistical tests that can compare models based on a single test set is an important consideration for modern machine learning, specifically in the field of deep learning.

Deep learning models are often large and operate on very large datasets. Together, these factors can mean that the training of a model can take days or even weeks on fast modern hardware.

This precludes the practical use of resampling methods to compare models and suggests the need to use a test that can operate on the results of evaluating trained models on a single test dataset.

The McNemar’s test may be a suitable test for evaluating these large and slow-to-train deep learning models.

The McNemar’s test operates upon a contingency table.

Before we dive into the test, let’s take a moment to understand how the contingency table for two classifiers is calculated.

A contingency table is a tabulation or count of two categorical variables. In the case of the McNemar’s test, we are interested in binary variables correct/incorrect or yes/no for a control and a treatment or two cases. This is called a 2×2 contingency table.

The contingency table may not be intuitive at first glance. Let’s make it concrete with a worked example.

Consider that we have two trained classifiers. Each classifier makes binary class prediction for each of the 10 examples in a test dataset. The predictions are evaluated and determined to be correct or incorrect.

We can then summarize these results in a table, as follows:

Instance, Classifier1 Correct, Classifier2 Correct 1 Yes No 2 No No 3 No Yes 4 No No 5 Yes Yes 6 Yes Yes 7 Yes Yes 8 No No 9 Yes No 10 Yes Yes

We can see that Classifier1 got 6 correct, or an accuracy of 60%, and Classifier2 got 5 correct, or 50% accuracy on the test set.

The table can now be reduced to a contingency table.

The contingency table relies on the fact that both classifiers were trained on exactly the same training data and evaluated on exactly the same test data instances.

The contingency table has the following structure:

Classifier2 Correct, Classifier2 Incorrect Classifier1 Correct ?? ?? Classifier1 Incorrect ?? ??

In the case of the first cell in the table, we must sum the total number of test instances that Classifier1 got correct and Classifier2 got correct. For example, the first instance that both classifiers predicted correctly was instance number 5. The total number of instances that both classifiers predicted correctly was 4.

Another more programmatic way to think about this is to sum each combination of Yes/No in the results table above.

Classifier2 Correct, Classifier2 Incorrect Classifier1 Correct Yes/Yes Yes/No Classifier1 Incorrect No/Yes No/No

The results organized into a contingency table are as follows:

Classifier2 Correct, Classifier2 Incorrect Classifier1 Correct 4 2 Classifier1 Incorrect 1 3

McNemar’s test is a paired nonparametric or distribution-free statistical hypothesis test.

It is also less intuitive than some other statistical hypothesis tests.

The McNemar’s test is checking if the disagreements between two cases match. Technically, this is referred to as the homogeneity of the contingency table (specifically the marginal homogeneity). Therefore, the McNemar’s test is a type of homogeneity test for contingency tables.

The test is widely used in medicine to compare the effect of a treatment against a control.

In terms of comparing two binary classification algorithms, the test is commenting on whether the two models disagree in the same way (or not). It is not commenting on whether one model is more or less accurate or error prone than another. This is clear when we look at how the statistic is calculated.

The McNemar’s test statistic is calculated as:

statistic = (Yes/No - No/Yes)^2 / (Yes/No + No/Yes)

Where Yes/No is the count of test instances that Classifier1 got correct and Classifier2 got incorrect, and No/Yes is the count of test instances that Classifier1 got incorrect and Classifier2 got correct.

This calculation of the test statistic assumes that each cell in the contingency table used in the calculation has a count of at least 25. The test statistic has a Chi-Squared distribution with 1 degree of freedom.

We can see that only two elements of the contingency table are used, specifically that the Yes/Yes and No/No elements are not used in the calculation of the test statistic. As such, we can see that the statistic is reporting on the different correct or incorrect predictions between the two models, not the accuracy or error rates. This is important to understand when making claims about the finding of the statistic.

The default assumption, or null hypothesis, of the test is that the two cases disagree to the same amount. If the null hypothesis is rejected, it suggests that there is evidence to suggest that the cases disagree in different ways, that the disagreements are skewed.

Given the selection of a significance level, the p-value calculated by the test can be interpreted as follows:

**p > alpha**: fail to reject H0, no difference in the disagreement (e.g. treatment had no effect).**p <= alpha**: reject H0, significant difference in the disagreement (e.g. treatment had an effect).

It is important to take a moment to clearly understand how to interpret the result of the test in the context of two machine learning classifier models.

The two terms used in the calculation of the McNemar’s Test capture the errors made by both models. Specifically, the No/Yes and Yes/No cells in the contingency table. The test checks if there is a significant difference between the counts in these two cells. That is all.

If these cells have counts that are similar, it shows us that both models make errors in much the same proportion, just on different instances of the test set. In this case, the result of the test would not be significant and the null hypothesis would not be rejected.

Under the null hypothesis, the two algorithms should have the same error rate …

— Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithm, 1998.

If these cells have counts that are not similar, it shows that both models not only make different errors, but in fact have a different relative proportion of errors on the test set. In this case, the result of the test would be significant and we would reject the null hypothesis.

So we may reject the null hypothesis in favor of the hypothesis that the two algorithms have different performance when trained on the particular training

— Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithm, 1998.

We can summarize this as follows:

**Fail to Reject Null Hypothesis**: Classifiers have a similar proportion of errors on the test set.**Reject Null Hypothesis**: Classifiers have a different proportion of errors on the test set.

After performing the test and finding a significant result, it may be useful to report an effect statistical measure in order to quantify the finding. For example, a natural choice would be to report the odds ratios, or the contingency table itself, although both of these assume a sophisticated reader.

It may be useful to report the difference in error between the two classifiers on the test set. In this case, be careful with your claims as the significant test does not report on the difference in error between the models, only the relative difference in the proportion of error between the models.

Finally, in using the McNemar’s test, Dietterich highlights two important limitations that must be considered. They are:

Generally, model behavior varies based on the specific training data used to fit the model.

This is due to both the interaction of the model with specific training instances and the use of randomness during learning. Fitting the model on multiple different training datasets and evaluating the skill, as is done with resampling methods, provides a way to measure the variance of the model.

The test is appropriate if the sources of variability are small.

Hence, McNemar’s test should only be applied if we believe these sources of variability are small.

— Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithm, 1998.

The two classifiers are evaluated on a single test set, and the test set is expected to be smaller than the training set.

This is different from hypothesis tests that make use of resampling methods as more, if not all, of the dataset is made available as a test set during evaluation (which introduces its own problems from a statistical perspective).

This provides less of an opportunity to compare the performance of the models. It requires that the test set is an appropriately representative of the domain, often meaning that the test dataset is large.

The McNemar’s test can be implemented in Python using the mcnemar() Statsmodels function.

The function takes the contingency table as an argument and returns the calculated test statistic and p-value.

There are two ways to use the statistic depending on the amount of data.

If there is a cell in the table that is used in the calculation of the test statistic that has a count of less than 25, then a modified version of the test is used that calculates an exact p-value using a binomial distribution. This is the default usage of the test:

stat, p = mcnemar(table, exact=True)

Alternately, if all cells used in the calculation of the test statistic in the contingency table have a value of 25 or more, then the standard calculation of the test can be used.

stat, p = mcnemar(table, exact=False, correction=True)

We can calculate the McNemar’s on the example contingency table described above. This contingency table has a small count in both the disagreement cells and as such the exact method must be used.

The complete example is listed below.

# Example of calculating the mcnemar test from statsmodels.stats.contingency_tables import mcnemar # define contingency table table = [[4, 2], [1, 3]] # calculate mcnemar test result = mcnemar(table, exact=True) # summarize the finding print('statistic=%.3f, p-value=%.3f' % (result.statistic, result.pvalue)) # interpret the p-value alpha = 0.05 if result.pvalue > alpha: print('Same proportions of errors (fail to reject H0)') else: print('Different proportions of errors (reject H0)')

Running the example calculates the statistic and p-value on the contingency table and prints the results.

We can see that the test strongly confirms that there is very little difference in the disagreements between the two cases. The null hypothesis not rejected.

As we are using the test to compare classifiers, we state that there is no statistically significant difference in the disagreements between the two models.

statistic=1.000, p-value=1.000 Same proportions of errors (fail to reject H0)

This section lists some ideas for extending the tutorial that you may wish to explore.

- Find a research paper in machine learning that makes use of the McNemar’s statistical hypothesis test.
- Update the code example such that the contingency table shows a significant difference in disagreement between the two cases.
- Implement a function that will use the correct version of the McNemar’s test based on the provided contingency table.

If you explore any of these extensions, I’d love to know.

This section provides more resources on the topic if you are looking to go deeper.

- Note on the sampling error of the difference between correlated proportions or percentages, 1947.
- Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms, 1998.

In this tutorial, you discovered how to use the McNemar’s test statistical hypothesis test to compare machine learning classifier models on a single test dataset.

Specifically, you learned:

- The recommendation of the McNemar’s test for models that are expensive to train, which suits large deep learning models.
- How to transform prediction results from two classifiers into a contingency table and how the table is used to calculate the statistic in the McNemar’s test.
- How to calculate the McNemar’s test in Python and interpret and report the result.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Calculate McNemar’s Test to Compare Two Machine Learning Classifiers appeared first on Machine Learning Mastery.

]]>The post The Role of Randomization to Address Confounding Variables in Machine Learning appeared first on Machine Learning Mastery.

]]>A challenge is that there are aspects of the problem and the algorithm called confounding variables that cannot be controlled (held constant) and must be controlled-for. An example is the use of randomness in a learning algorithm, such as random initialization or random choices during learning.

The solution is to use randomness in a way that has become a standard in applied machine learning. We can learn more about the rationale for using randomness in controlled experiments by looking briefly at why randomness is used to manage confounding variables in medicine through the use of randomized clinical trials.

In this post, you will discover confounding variables and how we can address them using the tool of randomization.

After reading this post, you will know:

- Confounding variables correlated with the independent and dependent variable confuse the effects and impact the results of experiments.
- Applied machine learning is concerned with controlled experiments that do suffer known confounding variables.
- Randomization of experiments is the key to controlling for confounding variables in machine learning experiments.

**Kick-start your project** with my new book Statistics for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

This post is divided into four parts;l they are:

- Confounding Variables
- Confounding in Machine Learning
- Randomization of Experiments
- Randomization in Machine Learning

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

In an experiment, we are often interested in the effect of an independent variable on a dependent variable.

A confounding variable is a variable that confuses the relationship between the independent and the dependent variable.

Confounding, sometimes referred to as confounding bias, is mostly described as a ‘mixing’ or ‘blurring’ of effects.

— Confounding: What it is and how to deal with it, 2008.

A confounding variable can influence the outcome of an experiment in many ways, such as:

- Invalid correlations.
- Increasing variance.
- Introducing a bias.

A confounding variable may be known or unknown.

They are often characterized as having an association or correlation with both the independent and dependent variables.

Another characterization is that the confounding variable affects groups or observations differently.

Confounding variables or confounders are often defined as the variables correlate (positively or negatively) with both the dependent variable and the independent variable. A Confounder is an extraneous variable whose presence affects the variables being studied so that the results do not reflect the actual relationship between the variables under study.

— How to control confounding effects by statistical analysis, 2012.

The deeper difficulty of confounding variables is that it may not be obvious that they exist and are impacting results.

The effects of confounding variables are often not obvious or even identifiable unless they are specifically addressed in the design of the experiment or data collection method.

Confounding variables are traditionally a concern in applied statistics.

This is because in statistics we are often concerned with the effect of independent variables on dependent variables in data. Statistical methods are designed to discover and describe these relationships and confounding variables can essentially corrupt or invalidate discoveries.

Machine learning practitioners are typically interested in the skill of a predictive model and less concerned with the statistical correctness or interpretability of the model. As such, confounding variables are an important topic when it comes to data selection and preparation, but less important than they may be when developing descriptive statistical models.

Nevertheless, confounding variables are critically important in applied machine learning.

The evaluation of a machine learning model is an experiment with independent and dependent variables. As such, it is subject to confounding variables.

What may be surprising is that you already know this and that the gold-standard practices in applied machine learning address this. Therefore, being intimately aware of the confounding variables in machine learning experiments is required to understand the choice and interpretation of machine learning model evaluation.

Consider, what impacts the evaluation of a machine learning model, what are the independent variables?

Some examples include:

- The choice of data preparation schemes.
- The choice of the samples in the training dataset.
- The choice of the samples in the test dataset.
- The choice of learning algorithm.
- The choice of the initialization of the learning algorithm.
- The choice of the configuration of the learning algorithm.

Each of these choices will impact the dependent variable in a machine learning experiment, which is the chosen metric used to estimate the skill of the model when making predictions.

The evaluation of a machine learning model involves the design and execution of controlled experiments. A controlled experiment holds all elements constant except one element under study. The two most common types of controlled experiments in machine learning are:

- Controlled experiments to vary and evaluate learning algorithms.
- Controlled experiments to vary and evaluate learning algorithm configurations.

Nevertheless, there are confounding variables that the controlled experiments cannot hold constant. Specifically, there are sources of randomness, that if they were held constant would result in an invalid evaluation of the model. Three examples include:

- Randomness in the data sample.
- Randomness in model initialization.
- Randomness in the learning algorithm.

For example, weights in a neural network are initialized to random values. Stochastic gradient descent randomizes the order of samples in an epoch to vary the types of updates performed. Random subsets of features are selected for each possible cut point in random forest. And many more examples.

Randomization in machine learning algorithms is not a bug; it is a feature intended to improve the performance of the model on average over classical deterministic methods.

Randomness can be present in ML at many different levels, usually enhancing performance or alleviating problems and difficulties of classical methods.

— Randomized Machine Learning Approaches: Recent Developments and Challenges, 2017.

These are confounding variables that we cannot hold constant. If they are held constant, the evaluation of the model will no longer provide insight into the generalizability of the result. We will know how well the model performs on a specific data sample or initialization of sequence of decisions during learning, but little idea on how the model will perform in general.

The way that we can handle confounding variables that we cannot control is by using randomization.

Randomization is a technique used in experimental design to give control over confounding variables that cannot (should not) be held constant.

For example, randomization is used in clinical experiments to control-for the biological differences between individual human beings when evaluating a treatment. It is the reason why a treatment must be evaluated on multiple individuals rather than on a single individual before the findings can be generalized.

In randomization the random assignment of study subjects to exposure categories to breaking any links between exposure and confounders. This reduces potential for confounding by generating groups that are fairly comparable with respect to known and unknown confounding variables.

— How to control confounding effects by statistical analysis, 2012.

Randomization is a simple tool in experimental design that allows the confounding variables to have their effect across a sample. It shifts the experiment from looking at an individual case to a collection of observations, where statistical tools are used to interpret the finding.

In medicine, randomization is the gold standard for evaluating a treatment and is called the randomized clinical trial. It is designed to remove not only the confounding effects of biological differences, but also the bias, such as the effect of the experimenter choosing the members of the treatment and non-treatment groups. You can imagine that a treatment would look very successful if the least-sick members of a cohort were chosen to be administered.

An [Randomized clinical trial] is a special kind of cohort study, with the characteristic that patients are randomly assigned to the experimental group (with exposure) and the control group (without exposure). […] Therefore, randomization helps to prevent selection by the clinician, and helps to establish groups that are equal with respect to relevant prognostic factors.

— The randomized clinical trial: An unbeatable standard in clinical research?, 2007.

There are still confounding variables when using a randomized clinical trial. An example is the case where the experimenters know what treatment participants of the study are receiving. This can impact the way the experimenters interact with the participants, which in turn can impact the results of the experiment.

The answer is to use blinding where participants or experimenters do not know the treatment. Ideally, a double-blind experiment is adopted, ensuring that both participates and experimenters are unaware of their treatment.

When feasible, it is strongly recommended that also after randomization, patients and clinicians do not know who receives the intervention and who does not. Studies may be single blind (either the patient or the clinician does not know who receives the treatment and who does not) or double blind (both the patient and the clinician do not know who receives the treatment).

— The randomized clinical trial: An unbeatable standard in clinical research?, 2007.

Note, before we move on to look at the use of randomization in machine learning, consider that there are other approaches to managing the effect of confounding variables. Wikipedia has a good list here.

Randomization is used in the evaluation of machine learning models to manage the uncontrollable confounding variables.

It is key to the standard ways described for evaluating machine learning models and the rationale for using methods such as data resampling and repeating experiments.

- Resampling methods are used to randomize the training and test datasets to help estimate the skill of training and evaluating models on random samples of data from the domain, rather than on a specific sample of data.
- Evaluation experiments are repeated to help estimate the skill of the model with different random initialization and learning decisions, rather than on a single set of initial conditions and sequence of learning decisions.

Randomization allows the machine learning practitioner to generalize a finding, to make it useful and applicable. It’s the reason why careful design of the test harness and resampling method is important. It is the reason why we repeat the evaluation of a model and the reason we don’t fix the seed on the pseudorandom number generator.

I talk more about these topics in the posts:

When we take a closer look at why we use randomization, to control for confounding variables, it raises questions about the other confounders that we may not be controlling for.

For example, the machine learning practitioner knowing the skill of models prior to giving each model a chance to do its best via data preparation and hyperparameter tuning. Perhaps practitioners should blind themselves to remove the possibility of biasing the choice of final model.

The risk is that the practitioner that really likes artificial neural networks will “*discover*” a neural network configuration that outperforms other models.

At best it is a statistical fluke or violation of Occam’s Razor for a parsimonious solution to a predictive modeling project; at worst, it is scientific fraud. The reason that clinicians aggressively removed this bias is people’s lives were at risk. We may get to that point with machine learning algorithms, e.g. in cars.

In practice, today, I think this is good motivation for front-loading an experiment with a large and careful design and automating the execution and statistical interpretation of the results.

This section provides more resources on the topic if you are looking to go deeper.

- Confounding on Wikipedia
- Controlling for a variable on Wikipedia
- Randomized controlled trial on Wikipedia
- The randomized clinical trial: An unbeatable standard in clinical research?, 2007.
- Confounding: What it is and how to deal with it, 2008.
- Confounding variables in machine learning predictions? on Cross Validated
- How to control confounding effects by statistical analysis, 2012.
- Randomized Machine Learning Approaches: Recent Developments and Challenges, 2017.

In this post, you discovered confounding variables and how we can address them using the tool of randomization.

Specifically, you learned:

- Confounding variables correlated with the independent and dependent variable and confuse the effects and impact the results of experiments.
- Applied machine learning is concerned with controlled experiments that do suffer known confounding variables.
- Randomization of experiments is the key to controlling for confounding variables in machine learning experiments.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post The Role of Randomization to Address Confounding Variables in Machine Learning appeared first on Machine Learning Mastery.

]]>The post All of Statistics for Machine Learning appeared first on Machine Learning Mastery.

]]>The book “All of Statistics” was written specifically to provide a foundation in probability and statistics for computer science undergraduates that may have an interest in data mining and machine learning. As such, it is often recommended as a book to machine learning practitioners interested in expanding their understanding of statistics.

In this post, you will discover the book “All of Statistics”, the topics it covers, and a reading list intended for machine learning practitioners.

After reading this post, you will know:

- Larry Wasserman wrote “
*All of Statistics*” to quickly bring computer science students up to speed with probability and statistics. - The book provides a broad coverage of the field of statistics with a focus on the mathematical presentation of the topics covered.
- The book covers much more than is required by machine learning practitioners, but a select reading of topics will be helpful for those that prefer a mathematical treatment.

**Kick-start your project** with my new book Statistics for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

The book “All of Statistics: A Concise Course in Statistical Inference” was written by Larry Wasserman and released in 2004.

Wasserman is a professor of statistics and data science at Carnegie Mellon University.

The book is ambitious.

It seeks to quickly bring computer science students up-to-speed with probability and statistics. As such, the topics covered by the book are very broad, perhaps broader than the average introductory textbooks.

Taken literally, the title “All of Statistics” is an exaggeration. But in spirit, the title is apt, as the book does cover a much broader range of topics than a typical introductory book on mathematical statistics. This book is for people who want to learn probability and statistics quickly.

— Page vii, All of Statistics: A Concise Course in Statistical Inference, 2004.

The book is not for the average practitioner; it is intended for computer science undergraduate students. It does assume some prior knowledge in calculus and linear algebra. If you don’t like equations or mathematical notation, this book is not for you.

Interestingly, Wasserman wrote the book in response to the rise of data mining and machine learning in computer science occurring outside of classical statistics. He asserts in the preface the importance of having a grounding in statistics in order to be effective in machine learning.

Using fancy tools like neural nets, boosting, and support vector machines without understanding basic statistics is like doing brain surgery before knowing how to use a band-aid.

— Pages vii-viii, All of Statistics: A Concise Course in Statistical Inference, 2004.

The material is presented in a very clear and concise manner. A systematic approach is taken with brief descriptions of a method, equations describing its implementation, and worked examples to motivate the use of the method with sample code in R.

In fact, the material is so compact that it often reads like a series of encyclopedia examples. This is great if you want to know how to implement a method, but very challenging if you are new to the methods and seeking intuitions.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

The choice of topics covered by the book is very broad, as mentioned in the previous section.

This is great on the one hand as the reader is given exposure to advanced subjects early on. The downside of this aggressive scope is that topics are touched on briefly with very little hand holding. You are left to re-read sections until you get it.

Let’s look at the topics covered by the book.

This is helpful to both get an idea of the presented scope of the field and the context for the topics that may interest you as a machine learning practitioner.

The book is divided into three parts; they are:

- I Probability
- II Statistical Inference
- III Statistical Models and Methods

The first part of the book focuses on probability theory and formal language for describing uncertainty. The second part is focused on statistical inference. The third part focuses on specific methods and problems raised in the second part.

The book does have a reference or encyclopedia feeling. As such, there are a lot of chapters, but each chapter is reasonably standalone. The book is divided into 24 chapters; they are:

- Chapter 1: Probability
- Chapter 2: Random Variables
- Chapter 3: Expectation
- Chapter 4: Inequalities
- Chapter 5: Convergence of Random Variables
- Chapter 6: Models, Statistical Inference and Learning
- Chapter 7: Estimating the CDF and Statistical Functions
- Chapter 8: The Bootstrap
- Chapter 9: Parametric Inference
- Chapter 10: Hypothesis Testing and p-values
- Chapter 11: Bayesian Inference
- Chapter 12: Statistical Decision Theory
- Chapter 13: Linear and Logistic Regression
- Chapter 14: Multivariate Models
- Chapter 15: Inference About Independence
- Chapter 16: Causal Inference
- Chapter 17: Directed Graphs and Conditional Independence
- Chapter 18: Undirected Graphs
- Chapter 19: Log-Linear Models
- Chapter 20: Nonparametric Curve Estimation
- Chapter 21: Smoothing Using Orthogonal Functions
- Chapter 22: Classification
- Chapter 23: Probability Redux: Stochastic Processes

Chapter 24: Simulation Methods

The preface for the book provides a useful glossary of terms mapping them from statistics to computer science. This “Statistics/Data Mining Dictionary” is reproduced below.

All of the R code and datasets used in the worked examples in the book are available from Wasserman’s homepage. This is very helpful as you can focus on experimenting with the examples rather than typing in the code and hoping that you got the syntax correct.

I would not recommend this book to developers who have not touched statistics before. It’s too challenging.

I would recommend this book to computer science students who are in math-learning-mode. I would also recommend it to machine learning practitioners with some previous background in statistics or a strong mathematical foundation.

If you are comfortable with mathematical notation and you know what you’re looking for, this book is an excellent reference. You can flip to the topic or the method and get a crisp presentation.

The problem is, for a machine learning practitioner, you do need to know about many of these topics, just not at the level of detail presented. Perhaps a shade lighter, at the intuition level. If you are up to it, it would be worth reading (or skimming) the following chapters in order to build a solid foundation in probability for statistics:

- Chapter 1: Probability
- Chapter 2: Random Variables
- Chapter 3: Expectation
- Chapter 5: Convergence of Random Variables

Again, these are important topics, but you require a concept-level understanding only.

For coverage of statistical hypothesis tests that you may use to interpret data and compare the skill of models, the following chapters are recommended reading:

- Chapter 6: Models, Statistical Inference and Learning
- Chapter 9: Parametric Inference
- Chapter 10: Hypothesis Testing and p-values

I would also recommend the chapter on the Bootstrap. It’s just a great method to have in your head, but with a focus for either better understanding bagging and random forest or as a procedure for estimating confidence intervals of model skill.

- Chapter 8: The Bootstrap

Finally, a statistical approach is used to present machine learning algorithms. I would recommend these chapters if you prefer a more mathematical treatment of regression and classification algorithms:

- Chapter 12: Statistical Decision Theory
- Chapter 13: Linear and Logistic Regression
- Chapter 22: Classification

I can read the mathematical presentation of statistics, but I prefer intuitions and working code. I am less likely to pick up this book from my bookcase, in favor of gentler treatments such as “Statistics in Plain English” or application focused treatments such as “Empirical Methods for Artificial Intelligence“.

Do you agree with this reading list?

Let me know in the comments below.

This section provides more resources on the topic if you are looking to go deeper.

- All of Statistics: A Concise Course in Statistical Inference, 2004.
- All of Statistics Errata
- Larry Wasserman Homepage

In this post, you discovered the book “*All of Statistics*” that provides a broad and concise introduction to statistics.

Specifically, you learned:

- Larry Wasserman wrote “
*All of Statistics*” to quickly bring computer science students up to speed with probability and statistics. - The book provides a broad coverage of the field of statistics with a focus on the mathematical presentation of the topics covered.
- The book covers much more than is required by machine learning practitioners, but a select reading of topics will be helpful for those that prefer a mathematical treatment.

Have you read this book?

What did you think of it? Let me know in the comments below.

Are you thinking of picking up a copy of this book?

Let me know in the comments.

The post All of Statistics for Machine Learning appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to Statistical Power and Power Analysis in Python appeared first on Machine Learning Mastery.

]]>Power can be calculated and reported for a completed experiment to comment on the confidence one might have in the conclusions drawn from the results of the study. It can also be used as a tool to estimate the number of observations or sample size required in order to detect an effect in an experiment.

In this tutorial, you will discover the importance of the statistical power of a hypothesis test and now to calculate power analyses and power curves as part of experimental design.

After completing this tutorial, you will know:

- Statistical power is the probability of a hypothesis test of finding an effect if there is an effect to be found.
- A power analysis can be used to estimate the minimum sample size required for an experiment, given a desired significance level, effect size, and statistical power.
- How to calculate and plot power analysis for the Student’s t test in Python in order to effectively design an experiment.

**Kick-start your project** with my new book Statistics for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

This tutorial is divided into four parts; they are:

- Statistical Hypothesis Testing
- What Is Statistical Power?
- Power Analysis
- Student’s t Test Power Analysis

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

A statistical hypothesis test makes an assumption about the outcome, called the null hypothesis.

For example, the null hypothesis for the Pearson’s correlation test is that there is no relationship between two variables. The null hypothesis for the Student’s t test is that there is no difference between the means of two populations.

The test is often interpreted using a p-value, which is the probability of observing the result given that the null hypothesis is true, not the reverse, as is often the case with misinterpretations.

**p-value (p)**: Probability of obtaining a result equal to or more extreme than was observed in the data.

In interpreting the p-value of a significance test, you must specify a significance level, often referred to as the Greek lower case letter alpha (a). A common value for the significance level is 5% written as 0.05.

The p-value is interested in the context of the chosen significance level. A result of a significance test is claimed to be “*statistically significant*” if the p-value is less than the significance level. This means that the null hypothesis (that there is no result) is rejected.

**p <= alpha**: reject H0, different distribution.**p > alpha**: fail to reject H0, same distribution.

Where:

**Significance level (alpha)**: Boundary for specifying a statistically significant finding when interpreting the p-value.

We can see that the p-value is just a probability and that in actuality the result may be different. The test could be wrong. Given the p-value, we could make an error in our interpretation.

There are two types of errors; they are:

**Type I Error**. Reject the null hypothesis when there is in fact no significant effect (false positive). The p-value is optimistically small.**Type II Error**. Not reject the null hypothesis when there is a significant effect (false negative). The p-value is pessimistically large.

In this context, we can think of the significance level as the probability of rejecting the null hypothesis if it were true. That is the probability of making a Type I Error or a false positive.

Statistical power, or the power of a hypothesis test is the probability that the test correctly rejects the null hypothesis.

That is, the probability of a true positive result. It is only useful when the null hypothesis is rejected.

… statistical power is the probability that a test will correctly reject a false null hypothesis. Statistical power has relevance only when the null is false.

— Page 60, The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis, and the Interpretation of Research Results, 2010.

The higher the statistical power for a given experiment, the lower the probability of making a Type II (false negative) error. That is the higher the probability of detecting an effect when there is an effect. In fact, the power is precisely the inverse of the probability of a Type II error.

Power = 1 - Type II Error Pr(True Positive) = 1 - Pr(False Negative)

More intuitively, the statistical power can be thought of as the probability of accepting an alternative hypothesis, when the alternative hypothesis is true.

When interpreting statistical power, we seek experiential setups that have high statistical power.

**Low Statistical Power**: Large risk of committing Type II errors, e.g. a false negative.**High Statistical Power**: Small risk of committing Type II errors.

Experimental results with too low statistical power will lead to invalid conclusions about the meaning of the results. Therefore a minimum level of statistical power must be sought.

It is common to design experiments with a statistical power of 80% or better, e.g. 0.80. This means a 20% probability of encountering a Type II area. This different to the 5% likelihood of encountering a Type I error for the standard value for the significance level.

Statistical power is one piece in a puzzle that has four related parts; they are:

**Effect Size**. The quantified magnitude of a result present in the population. Effect size is calculated using a specific statistical measure, such as Pearson’s correlation coefficient for the relationship between variables or Cohen’s d for the difference between groups.**Sample Size**. The number of observations in the sample.**Significance**. The significance level used in the statistical test, e.g. alpha. Often set to 5% or 0.05.**Statistical Power**. The probability of accepting the alternative hypothesis if it is true.

All four variables are related. For example, a larger sample size can make an effect easier to detect, and the statistical power can be increased in a test by increasing the significance level.

A power analysis involves estimating one of these four parameters given values for three other parameters. This is a powerful tool in both the design and in the analysis of experiments that we wish to interpret using statistical hypothesis tests.

For example, the statistical power can be estimated given an effect size, sample size and significance level. Alternately, the sample size can be estimated given different desired levels of significance.

Power analysis answers questions like “how much statistical power does my study have?” and “how big a sample size do I need?”.

— Page 56, The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis, and the Interpretation of Research Results, 2010.

Perhaps the most common use of a power analysis is in the estimation of the minimum sample size required for an experiment.

Power analyses are normally run before a study is conducted. A prospective or a priori power analysis can be used to estimate any one of the four power parameters but is most often used to estimate required sample sizes.

— Page 57, The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis, and the Interpretation of Research Results, 2010.

As a practitioner, we can start with sensible defaults for some parameters, such as a significance level of 0.05 and a power level of 0.80. We can then estimate a desirable minimum effect size, specific to the experiment being performed. A power analysis can then be used to estimate the minimum sample size required.

In addition, multiple power analyses can be performed to provide a curve of one parameter against another, such as the change in the size of an effect in an experiment given changes to the sample size. More elaborate plots can be created varying three of the parameters. This is a useful tool for experimental design.

We can make the idea of statistical power and power analysis concrete with a worked example.

In this section, we will look at the Student’s t test, which is a statistical hypothesis test for comparing the means from two samples of Gaussian variables. The assumption, or null hypothesis, of the test is that the sample populations have the same mean, e.g. that there is no difference between the samples or that the samples are drawn from the same underlying population.

The test will calculate a p-value that can be interpreted as to whether the samples are the same (fail to reject the null hypothesis), or there is a statistically significant difference between the samples (reject the null hypothesis). A common significance level for interpreting the p-value is 5% or 0.05.

**Significance level (alpha)**: 5% or 0.05.

The size of the effect of comparing two groups can be quantified with an effect size measure. A common measure for comparing the difference in the mean from two groups is the Cohen’s d measure. It calculates a standard score that describes the difference in terms of the number of standard deviations that the means are different. A large effect size for Cohen’s d is 0.80 or higher, as is commonly accepted when using the measure.

**Effect Size**: Cohen’s d of at least 0.80.

We can use the default and assume a minimum statistical power of 80% or 0.8.

**Statistical Power**: 80% or 0.80.

For a given experiment with these defaults, we may be interested in estimating a suitable sample size. That is, how many observations are required from each sample in order to at least detect an effect of 0.80 with an 80% chance of detecting the effect if it is true (20% of a Type II error) and a 5% chance of detecting an effect if there is no such effect (Type I error).

We can solve this using a power analysis.

The statsmodels library provides the TTestIndPower class for calculating a power analysis for the Student’s t test with independent samples. Of note is the TTestPower class that can perform the same analysis for the paired Student’s t test.

The function solve_power() can be used to calculate one of the four parameters in a power analysis. In our case, we are interested in calculating the sample size. We can use the function by providing the three pieces of information we know (*alpha*, *effect*, and *power*) and setting the size of argument we wish to calculate the answer of (*nobs1*) to “*None*“. This tells the function what to calculate.

A note on sample size: the function has an argument called ratio that is the ratio of the number of samples in one sample to the other. If both samples are expected to have the same number of observations, then the ratio is 1.0. If, for example, the second sample is expected to have half as many observations, then the ratio would be 0.5.

The TTestIndPower instance must be created, then we can call the *solve_power()* with our arguments to estimate the sample size for the experiment.

# perform power analysis analysis = TTestIndPower() result = analysis.solve_power(effect, power=power, nobs1=None, ratio=1.0, alpha=alpha)

The complete example is listed below.

# estimate sample size via power analysis from statsmodels.stats.power import TTestIndPower # parameters for power analysis effect = 0.8 alpha = 0.05 power = 0.8 # perform power analysis analysis = TTestIndPower() result = analysis.solve_power(effect, power=power, nobs1=None, ratio=1.0, alpha=alpha) print('Sample Size: %.3f' % result)

Running the example calculates and prints the estimated number of samples for the experiment as 25. This would be a suggested minimum number of samples required to see an effect of the desired size.

Sample Size: 25.525

We can go one step further and calculate power curves.

Power curves are line plots that show how the change in variables, such as effect size and sample size, impact the power of the statistical test.

The plot_power() function can be used to create power curves. The dependent variable (x-axis) must be specified by name in the ‘*dep_var*‘ argument. Arrays of values can then be specified for the sample size (*nobs*), effect size (*effect_size*), and significance (*alpha*) parameters. One or multiple curves will then be plotted showing the impact on statistical power.

For example, we can assume a significance of 0.05 (the default for the function) and explore the change in sample size between 5 and 100 with low, medium, and high effect sizes.

# calculate power curves from multiple power analyses analysis = TTestIndPower() analysis.plot_power(dep_var='nobs', nobs=arange(5, 100), effect_size=array([0.2, 0.5, 0.8]))

The complete example is listed below.

# calculate power curves for varying sample and effect size from numpy import array from matplotlib import pyplot from statsmodels.stats.power import TTestIndPower # parameters for power analysis effect_sizes = array([0.2, 0.5, 0.8]) sample_sizes = array(range(5, 100)) # calculate power curves from multiple power analyses analysis = TTestIndPower() analysis.plot_power(dep_var='nobs', nobs=sample_sizes, effect_size=effect_sizes) pyplot.show()

Running the example creates the plot showing the impact on statistical power (y-axis) for three different effect sizes (*es*) as the sample size (x-axis) is increased.

We can see that if we are interested in a large effect that a point of diminishing returns in terms of statistical power occurs at around 40-to-50 observations.

Usefully, statsmodels has classes to perform a power analysis with other statistical tests, such as the F-test, Z-test, and the Chi-Squared test.

This section lists some ideas for extending the tutorial that you may wish to explore.

- Plot the power curves of different standard significance levels against the sample size.
- Find an example of a study that reports the statistical power of the experiment.
- Prepare examples of a performance analysis for other statistical tests provided by statsmodels.

If you explore any of these extensions, I’d love to know.

This section provides more resources on the topic if you are looking to go deeper.

- The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis, and the Interpretation of Research Results, 2010.
- Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis, 2011.
- Statistical Power Analysis for the Behavioral Sciences, 1988.
- Applied Power Analysis for the Behavioral Sciences, 2010.

- Statsmodels Power and Sample Size Calculations
- statsmodels.stats.power.TTestPower API
- statsmodels.stats.power.TTestIndPower
- statsmodels.stats.power.TTestIndPower.solve_power() API

statsmodels.stats.power.TTestIndPower.plot_power() API - Statistical Power in Statsmodels, 2013.
- Power Plots in statsmodels, 2013.

- Statistical power on Wikipedia
- Statistical hypothesis testing on Wikipedia
- Statistical significance on Wikipedia
- Sample size determination on Wikipedia
- Effect size on Wikipedia
- Type I and type II errors on Wikipedia

In this tutorial, you discovered the statistical power of a hypothesis test and how to calculate power analyses and power curves as part of experimental design.

Specifically, you learned:

- Statistical power is the probability of a hypothesis test of finding an effect if there is an effect to be found.
- A power analysis can be used to estimate the minimum sample size required for an experiment, given a desired significance level, effect size, and statistical power.
- How to calculate and plot power analysis for the Student’s t test in Python in order to effectively design an experiment.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Statistical Power and Power Analysis in Python appeared first on Machine Learning Mastery.

]]>