The post 15 Statistical Hypothesis Tests in Python (Cheat Sheet) appeared first on Machine Learning Mastery.

]]>applied machine learning, with sample code in Python.

Although there are hundreds of statistical hypothesis tests that you could use, there is only a small subset that you may need to use in a machine learning project.

In this post, you will discover a cheat sheet for the most popular statistical hypothesis tests for a machine learning project with examples using the Python API.

Each statistical test is presented in a consistent way, including:

- The name of the test.
- What the test is checking.
- The key assumptions of the test.
- How the test result is interpreted.
- Python API for using the test.

Note, when it comes to assumptions such as the expected distribution of data or sample size, the results of a given test are likely to degrade gracefully rather than become immediately unusable if an assumption is violated.

Generally, data samples need to be representative of the domain and large enough to expose their distribution to analysis.

In some cases, the data can be corrected to meet the assumptions, such as correcting a nearly normal distribution to be normal by removing outliers, or using a correction to the degrees of freedom in a statistical test when samples have differing variance, to name two examples.

Finally, there may be multiple tests for a given concern, e.g. normality. We cannot get crisp answers to questions with statistics; instead, we get probabilistic answers. As such, we can arrive at different answers to the same question by considering the question in different ways. Hence the need for multiple different tests for some questions we may have about data.

Let’s get started.

**Update Nov/2018**: Added a better overview of the tests covered.

This tutorial is divided into four parts; they are:

**Normality Tests**- Shapiro-Wilk Test
- D’Agostino’s K^2 Test
- Anderson-Darling Test

**Correlation Tests**- Pearson’s Correlation Coefficient
- Spearman’s Rank Correlation
- Kendall’s Rank Correlation
- Chi-Squared Test

**Parametric Statistical Hypothesis Tests**- Student’s t-test
- Paired Student’s t-test
- Analysis of Variance Test (ANOVA)
- Repeated Measures ANOVA Test

**Nonparametric Statistical Hypothesis Tests**- Mann-Whitney U Test
- Wilcoxon Signed-Rank Test
- Kruskal-Wallis H Test
- Friedman Test

This section lists statistical tests that you can use to check if your data has a Gaussian distribution.

Tests whether a data sample has a Gaussian distribution.

Assumptions

- Observations in each sample are independent and identically distributed (iid).

Interpretation

- H0: the sample has a Gaussian distribution.
- H1: the sample does not have a Gaussian distribution.

Python Code

from scipy.stats import shapiro data = .... stat, p = shapiro(data)

More Information

Tests whether a data sample has a Gaussian distribution.

Assumptions

- Observations in each sample are independent and identically distributed (iid).

Interpretation

- H0: the sample has a Gaussian distribution.
- H1: the sample does not have a Gaussian distribution.

Python Code

from scipy.stats import normaltest data = .... stat, p = normaltest(data)

More Information

Tests whether a data sample has a Gaussian distribution.

Assumptions

- Observations in each sample are independent and identically distributed (iid).

Interpretation

- H0: the sample has a Gaussian distribution.
- H1: the sample does not have a Gaussian distribution.

Python Code

from scipy.stats import anderson data = .... result = anderson(data)

More Information

This section lists statistical tests that you can use to check if two samples are related.

Tests whether two samples have a linear relationship.

Assumptions

- Observations in each sample are independent and identically distributed (iid).
- Observations in each sample are normally distributed.
- Observations in each sample have the same variance.

Interpretation

- H0: the two samples are independent.
- H1: there is a dependency between the samples.

Python Code

from scipy.stats import pearsonr data1, data2 = ... corr, p = pearsonr(data1, data2)

More Information

Tests whether two samples have a monotonic relationship.

Assumptions

- Observations in each sample are independent and identically distributed (iid).
- Observations in each sample can be ranked.

Interpretation

- H0: the two samples are independent.
- H1: there is a dependency between the samples.

Python Code

from scipy.stats import spearmanr data1, data2 = ... corr, p = spearmanr(data1, data2)

More Information

Tests whether two samples have a monotonic relationship.

Assumptions

- Observations in each sample are independent and identically distributed (iid).
- Observations in each sample can be ranked.

Interpretation

- H0: the two samples are independent.
- H1: there is a dependency between the samples.

Python Code

from scipy.stats import kendalltau data1, data2 = ... corr, p = kendalltau(data1, data2)

More Information

Tests whether two categorical variables are related or independent.

Assumptions

- Observations used in the calculation of the contingency table are independent.
- 25 or more examples in each cell of the contingency table.

Interpretation

- H0: the two samples are independent.
- H1: there is a dependency between the samples.

Python Code

from scipy.stats import chi2_contingency table = ... stat, p, dof, expected = chi2_contingency(table)

More Information

This section lists statistical tests that you can use to compare data samples.

Tests whether the means of two independent samples are significantly different.

Assumptions

- Observations in each sample are independent and identically distributed (iid).
- Observations in each sample are normally distributed.
- Observations in each sample have the same variance.

Interpretation

- H0: the means of the samples are equal.
- H1: the means of the samples are unequal.

Python Code

from scipy.stats import ttest_ind data1, data2 = ... stat, p = ttest_ind(data1, data2)

More Information

Tests whether the means of two paired samples are significantly different.

Assumptions

- Observations in each sample are independent and identically distributed (iid).
- Observations in each sample are normally distributed.
- Observations in each sample have the same variance.
- Observations across each sample are paired.

Interpretation

- H0: the means of the samples are equal.
- H1: the means of the samples are unequal.

Python Code

from scipy.stats import ttest_rel data1, data2 = ... stat, p = ttest_rel(data1, data2)

More Information

Tests whether the means of two or more independent samples are significantly different.

Assumptions

- Observations in each sample are independent and identically distributed (iid).
- Observations in each sample are normally distributed.
- Observations in each sample have the same variance.

Interpretation

- H0: the means of the samples are equal.
- H1: one or more of the means of the samples are unequal.

Python Code

from scipy.stats import f_oneway data1, data2, ... = ... stat, p = f_oneway(data1, data2, ...)

More Information

Tests whether the means of two or more paired samples are significantly different.

Assumptions

- Observations in each sample are independent and identically distributed (iid).
- Observations in each sample are normally distributed.
- Observations in each sample have the same variance.
- Observations across each sample are paired.

Interpretation

- H0: the means of the samples are equal.
- H1: one or more of the means of the samples are unequal.

Python Code

Currently not supported in Python.

More Information

Tests whether the distributions of two independent samples are equal or not.

Assumptions

- Observations in each sample are independent and identically distributed (iid).
- Observations in each sample can be ranked.

Interpretation

- H0: the distributions of both samples are equal.
- H1: the distributions of both samples are not equal.

Python Code

from scipy.stats import mannwhitneyu data1, data2 = ... stat, p = mannwhitneyu(data1, data2)

More Information

Tests whether the distributions of two paired samples are equal or not.

Assumptions

- Observations in each sample are independent and identically distributed (iid).
- Observations in each sample can be ranked.
- Observations across each sample are paired.

Interpretation

- H0: the distributions of both samples are equal.
- H1: the distributions of both samples are not equal.

Python Code

from scipy.stats import wilcoxon data1, data2 = ... stat, p = wilcoxon(data1, data2)

More Information

Tests whether the distributions of two or more independent samples are equal or not.

Assumptions

- Observations in each sample are independent and identically distributed (iid).
- Observations in each sample can be ranked.

Interpretation

- H0: the distributions of all samples are equal.
- H1: the distributions of one or more samples are not equal.

Python Code

from scipy.stats import kruskal data1, data2, ... = ... stat, p = kruskal(data1, data2, ...)

More Information

Tests whether the distributions of two or more paired samples are equal or not.

Assumptions

- Observations in each sample are independent and identically distributed (iid).
- Observations in each sample can be ranked.
- Observations across each sample are paired.

Interpretation

- H0: the distributions of all samples are equal.
- H1: the distributions of one or more samples are not equal.

Python Code

from scipy.stats import friedmanchisquare data1, data2, ... = ... stat, p = friedmanchisquare(data1, data2, ...)

More Information

This section provides more resources on the topic if you are looking to go deeper.

- A Gentle Introduction to Normality Tests in Python
- How to Use Correlation to Understand the Relationship Between Variables
- How to Use Parametric Statistical Significance Tests in Python
- A Gentle Introduction to Statistical Hypothesis Tests

In this tutorial, you discovered the key statistical hypothesis tests that you may need to use in a machine learning project.

Specifically, you learned:

- The types of tests to use in different circumstances, such as normality checking, relationships between variables, and differences between samples.
- The key assumptions for each test and how to interpret the test result.
- How to implement the test using the Python API.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

Did I miss an important statistical test or key assumption for one of the listed tests?

Let me know in the comments below.

The post 15 Statistical Hypothesis Tests in Python (Cheat Sheet) appeared first on Machine Learning Mastery.

]]>The post Statistics for Machine Learning (7-Day Mini-Course) appeared first on Machine Learning Mastery.

]]>Statistics is a field of mathematics that is universally agreed to be a prerequisite for a deeper understanding of machine learning.

Although statistics is a large field with many esoteric theories and findings, the nuts and bolts tools and notations taken from the field are required for machine learning practitioners. With a solid foundation of what statistics is, it is possible to focus on just the good or relevant parts.

In this crash course, you will discover how you can get started and confidently read and implement statistical methods used in machine learning with Python in seven days.

This is a big and important post. You might want to bookmark it.

Let’s get started.

Before we get started, let’s make sure you are in the right place.

This course is for developers that may know some applied machine learning. Maybe you know how to work through a predictive modeling problem end-to-end, or at least most of the main steps, with popular tools.

The lessons in this course do assume a few things about you, such as:

- You know your way around basic Python for programming.
- You may know some basic NumPy for array manipulation.
- You want to learn statistics to deepen your understanding and application of machine learning.

You do NOT need to know:

- You do not need to be a math wiz!
- You do not need to be a machine learning expert!

This crash course will take you from a developer that knows a little machine learning to a developer who can navigate the basics of statistical methods.

Note: This crash course assumes you have a working Python3 SciPy environment with at least NumPy installed. If you need help with your environment, you can follow the step-by-step tutorial here:

This crash course is broken down into seven lessons.

You could complete one lesson per day (recommended) or complete all of the lessons in one day (hardcore). It really depends on the time you have available and your level of enthusiasm.

Below is a list of the seven lessons that will get you started and productive with statistics for machine learning in Python:

**Lesson 01**: Statistics and Machine Learning**Lesson 02**: Introduction to Statistics**Lesson 03**: Gaussian Distribution and Descriptive Stats**Lesson 04**: Correlation Between Variables**Lesson 05**: Statistical Hypothesis Tests**Lesson 06**: Estimation Statistics**Lesson 07**: Nonparametric Statistics

Each lesson could take you 60 seconds or up to 30 minutes. Take your time and complete the lessons at your own pace. Ask questions and even post results in the comments below.

The lessons expect you to go off and find out how to do things. I will give you hints, but part of the point of each lesson is to force you to learn where to go to look for help on and about the statistical methods and the NumPy API and the best-of-breed tools in Python (hint: I have all of the answers directly on this blog; use the search box).

Post your results in the comments; I’ll cheer you on!

Hang in there; don’t give up.

Note: This is just a crash course. For a lot more detail and fleshed-out tutorials, see my book on the topic titled “Statistical Methods for Machine Learning.”

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

In this lesson, you will discover the five reasons why a machine learning practitioner should deepen their understanding of statistics.

Statistical methods are required in the preparation of train and test data for your machine learning model.

This includes techniques for:

- Outlier detection.
- Missing value imputation.
- Data sampling.
- Data scaling.
- Variable encoding.

And much more.

A basic understanding of data distributions, descriptive statistics, and data visualization is required to help you identify the methods to choose when performing these tasks.

Statistical methods are required when evaluating the skill of a machine learning model on data not seen during training.

This includes techniques for:

- Data sampling.
- Data resampling.
- Experimental design.

Resampling techniques such as k-fold cross-validation are often well understood by machine learning practitioners, but the rationale for why this method is required is not.

Statistical methods are required when selecting a final model or model configuration to use for a predictive modeling problem.

These include techniques for:

- Checking for a significant difference between results.
- Quantifying the size of the difference between results.

This might include the use of statistical hypothesis tests.

Statistical methods are required when presenting the skill of a final model to stakeholders.

This includes techniques for:

- Summarizing the expected skill of the model on average.
- Quantifying the expected variability of the skill of the model in practice.

This might include estimation statistics such as confidence intervals.

Statistical methods are required when making a prediction with a finalized model on new data.

This includes techniques for:

- Quantifying the expected variability for the prediction.

This might include estimation statistics such as prediction intervals.

For this lesson, you must list three reasons why you personally want to learn statistics.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will discover a concise definition of statistics.

In this lesson, you will discover a concise definition of statistics.

Statistics is a required prerequisite for most books and courses on applied machine learning. But what exactly is statistics?

Statistics is a subfield of mathematics. It refers to a collection of methods for working with data and using data to answer questions.

It is because the field is comprised of a grab bag of methods for working with data that it can seem large and amorphous to beginners. It can be hard to see the line between methods that belong to statistics and methods that belong to other fields of study.

When it comes to the statistical tools that we use in practice, it can be helpful to divide the field of statistics into two large groups of methods: descriptive statistics for summarizing data, and inferential statistics for drawing conclusions from samples of data.

**Descriptive Statistics**: Descriptive statistics refer to methods for summarizing raw observations into information that we can understand and share.**Inferential Statistics**: Inferential statistics is a fancy name for methods that aid in quantifying properties of the domain or population from a smaller set of obtained observations called a sample.

For this lesson, you must list three methods that can be used for each descriptive and inferential statistics.

Post your answer in the comments below. I would love to see what you discover.

In the next lesson, you will discover the Gaussian distribution and how to calculate summary statistics.

In this lesson, you will discover the Gaussian distribution for data and how to calculate simple descriptive statistics.

A sample of data is a snapshot from a broader population of all possible observations that could be taken from a domain or generated by a process.

Interestingly, many observations fit a common pattern or distribution called the normal distribution, or more formally, the Gaussian distribution. It is the bell-shaped distribution that you may be familiar with.

A lot is known about the Gaussian distribution, and as such, there are whole sub-fields of statistics and statistical methods that can be used with Gaussian data.

Any Gaussian distribution, and in turn any data sample drawn from a Gaussian distribution, can be summarized with just two parameters:

**Mean**. The central tendency or most likely value in the distribution (the top of the bell).**Variance**. The average difference that observations have from the mean value in the distribution (the spread).

The units of the mean are the same as the units of the distribution, although the units of the variance are squared, and therefore harder to interpret. A popular alternative to the variance parameter is the **standard deviation**, which is simply the square root of the variance, returning the units to be the same as those of the distribution.

The mean, variance, and standard deviation can be calculated directly on data samples in NumPy.

The example below generates a sample of 100 random numbers drawn from a Gaussian distribution with a known mean of 50 and a standard deviation of 5 and calculates the summary statistics.

# calculate summary stats from numpy.random import seed from numpy.random import randn from numpy import mean from numpy import var from numpy import std # seed the random number generator seed(1) # generate univariate observations data = 5 * randn(10000) + 50 # calculate statistics print('Mean: %.3f' % mean(data)) print('Variance: %.3f' % var(data)) print('Standard Deviation: %.3f' % std(data))

Run the example and compare the estimated mean and standard deviation from the expected values.

For this lesson, you must implement the calculation of one descriptive statistic from scratch in Python, such as the calculation of a sample mean.

Post your answer in the comments below. I would love to see what you discover.

In the next lesson, you will discover how to quantify the relationship between two variables.

In this lesson, you will discover how to calculate a correlation coefficient to quantify the relationship between two variables.

Variables in a dataset may be related for lots of reasons.

It can be useful in data analysis and modeling to better understand the relationships between variables. The statistical relationship between two variables is referred to as their correlation.

A correlation could be positive, meaning both variables move in the same direction, or negative, meaning that when one variable’s value increases, the other variables’ values decrease.

**Positive Correlation**: Both variables change in the same direction.**Neutral Correlation**: No relationship in the change of the variables.**Negative Correlation**: Variables change in opposite directions.

The performance of some algorithms can deteriorate if two or more variables are tightly related, called multicollinearity. An example is linear regression, where one of the offending correlated variables should be removed in order to improve the skill of the model.

We can quantify the relationship between samples of two variables using a statistical method called Pearson’s correlation coefficient, named for the developer of the method, Karl Pearson.

The *pearsonr()* NumPy function can be used to calculate the Pearson’s correlation coefficient for samples of two variables.

The complete example is listed below showing the calculation where one variable is dependent upon the second.

# calculate correlation coefficient from numpy.random import seed from numpy.random import randn from scipy.stats import pearsonr # seed random number generator seed(1) # prepare data data1 = 20 * randn(1000) + 100 data2 = data1 + (10 * randn(1000) + 50) # calculate Pearson's correlation corr, p = pearsonr(data1, data2) # display the correlation print('Pearsons correlation: %.3f' % corr)

Run the example and review the calculated correlation coefficient.

For this lesson, you must load a standard machine learning dataset and calculate the correlation between each pair of numerical variables.

Post your answer in the comments below. I would love to see what you discover.

In the next lesson, you will discover statistical hypothesis tests.

In this lesson, you will discover statistical hypothesis tests and how to compare two samples.

Data must be interpreted in order to add meaning. We can interpret data by assuming a specific structure our outcome and use statistical methods to confirm or reject the assumption.

The assumption is called a hypothesis and the statistical tests used for this purpose are called statistical hypothesis tests.

The assumption of a statistical test is called the null hypothesis, or hypothesis zero (H0 for short). It is often called the default assumption, or the assumption that nothing has changed. A violation of the test’s assumption is often called the first hypothesis, hypothesis one, or H1 for short.

**Hypothesis 0 (H0)**: Assumption of the test holds and is failed to be rejected.**Hypothesis 1 (H1)**: Assumption of the test does not hold and is rejected at some level of significance.

We can interpret the result of a statistical hypothesis test using a p-value.

The p-value is the probability of observing the data, given the null hypothesis is true.

A large probability means that the H0 or default assumption is likely. A small value, such as below 5% (o.05) suggests that it is not likely and that we can reject H0 in favor of H1, or that something is likely to be different (e.g. a significant result).

A widely used statistical hypothesis test is the Student’s t-test for comparing the mean values from two independent samples.

The default assumption is that there is no difference between the samples, whereas a rejection of this assumption suggests some significant difference. The tests assumes that both samples were drawn from a Gaussian distribution and have the same variance.

The Student’s t-test can be implemented in Python via the *ttest_ind()* SciPy function.

Below is an example of calculating and interpreting the Student’s t-test for two data samples that are known to be different.

# student's t-test from numpy.random import seed from numpy.random import randn from scipy.stats import ttest_ind # seed the random number generator seed(1) # generate two independent samples data1 = 5 * randn(100) + 50 data2 = 5 * randn(100) + 51 # compare samples stat, p = ttest_ind(data1, data2) print('Statistics=%.3f, p=%.3f' % (stat, p)) # interpret alpha = 0.05 if p > alpha: print('Same distributions (fail to reject H0)') else: print('Different distributions (reject H0)')

Run the code and review the calculated statistic and interpretation of the p-value.

For this lesson, you must list three other statistical hypothesis tests that can be used to check for differences between samples.

Post your answer in the comments below. I would love to see what you discover.

In the next lesson, you will discover estimation statistics as an alternative to statistical hypothesis testing.

In this lesson, you will discover estimation statistics that may be used as an alternative to statistical hypothesis tests.

Statistical hypothesis tests can be used to indicate whether the difference between two samples is due to random chance, but cannot comment on the size of the difference.

A group of methods referred to as “*new statistics*” are seeing increased use instead of or in addition to p-values in order to quantify the magnitude of effects and the amount of uncertainty for estimated values. This group of statistical methods is referred to as estimation statistics.

Estimation statistics is a term to describe three main classes of methods. The three main

classes of methods include:

**Effect Size**. Methods for quantifying the size of an effect given a treatment or intervention.**Interval Estimation**. Methods for quantifying the amount of uncertainty in a value.**Meta-Analysis**. Methods for quantifying the findings across multiple similar studies.

Of the three, perhaps the most useful methods in applied machine learning are interval estimation.

There are three main types of intervals. They are:

**Tolerance Interval**: The bounds or coverage of a proportion of a distribution with a specific level of confidence.**Confidence Interval**: The bounds on the estimate of a population parameter.**Prediction Interval**: The bounds on a single observation.

A simple way to calculate a confidence interval for a classification algorithm is to calculate the binomial proportion confidence interval, which can provide an interval around a model’s estimated accuracy or error.

This can be implemented in Python using the *confint()* Statsmodels function.

The function takes the count of successes (or failures), the total number of trials, and the significance level as arguments and returns the lower and upper bound of the confidence interval.

The example below demonstrates this function in a hypothetical case where a model made 88 correct predictions out of a dataset with 100 instances and we are interested in the 95% confidence interval (provided to the function as a significance of 0.05).

# calculate the confidence interval from statsmodels.stats.proportion import proportion_confint # calculate the interval lower, upper = proportion_confint(88, 100, 0.05) print('lower=%.3f, upper=%.3f' % (lower, upper))

Run the example and review the confidence interval on the estimated accuracy.

For this lesson, you must list two methods for calculating the effect size in applied machine learning and when they might be useful.

As a hint, consider one for the relationship between variables and one for the difference between samples.

Post your answer in the comments below. I would love to see what you discover.

In the next lesson, you will discover nonparametric statistical methods.

In this lesson, you will discover statistical methods that may be used when your data does not come from a Gaussian distribution.

A large portion of the field of statistics and statistical methods is dedicated to data where the distribution is known.

Data in which the distribution is unknown or cannot be easily identified is called nonparametric.

In the case where you are working with nonparametric data, specialized nonparametric statistical methods can be used that discard all information about the distribution. As such, these methods are often referred to as distribution-free methods.

Before a nonparametric statistical method can be applied, the data must be converted into a rank format. As such, statistical methods that expect data in rank format are sometimes called rank statistics, such as rank correlation and rank statistical hypothesis tests. Ranking data is exactly as its name suggests.

The procedure is as follows:

- Sort all data in the sample in ascending order.
- Assign an integer rank from 1 to N for each unique value in the data sample.

A widely used nonparametric statistical hypothesis test for checking for a difference between two independent samples is the Mann-Whitney U test, named for Henry Mann and Donald Whitney.

It is the nonparametric equivalent of the Student’s t-test but does not assume that the data is drawn from a Gaussian distribution.

The test can be implemented in Python via the *mannwhitneyu()* SciPy function.

The example below demonstrates the test on two data samples drawn from a uniform distribution known to be different.

# example of the mann-whitney u test from numpy.random import seed from numpy.random import rand from scipy.stats import mannwhitneyu # seed the random number generator seed(1) # generate two independent samples data1 = 50 + (rand(100) * 10) data2 = 51 + (rand(100) * 10) # compare samples stat, p = mannwhitneyu(data1, data2) print('Statistics=%.3f, p=%.3f' % (stat, p)) # interpret alpha = 0.05 if p > alpha: print('Same distribution (fail to reject H0)') else: print('Different distribution (reject H0)')

Run the example and review the calculated statistics and interpretation of the p-value.

For this lesson, you must list three additional nonparametric statistical methods.

Post your answer in the comments below. I would love to see what you discover.

This was the final lesson in the mini-course.

(Look How Far You Have Come)

You made it. Well done!

Take a moment and look back at how far you have come.

You discovered:

- The importance of statistics in applied machine learning.
- A concise definition of statistics and a division of methods into two main types.
- The Gaussian distribution and how to describe data with this distribution using statistics.
- How to quantify the relationship between the samples of two variables.
- How to check for the difference between two samples using statistical hypothesis tests.
- An alternative to statistical hypothesis tests called estimation statistics.
- Nonparametric methods that can be used when data is not drawn from the Gaussian distribution.

This is just the beginning of your journey with statistics for machine learning. Keep practicing and developing your skills.

Take the next step and check out my book on Statistical Methods for Machine Learning.

How did you do with the mini-course?

Did you enjoy this crash course?

Do you have any questions? Were there any sticking points?

Let me know. Leave a comment below.

The post Statistics for Machine Learning (7-Day Mini-Course) appeared first on Machine Learning Mastery.

]]>The post How to Code the Student’s t-Test from Scratch in Python appeared first on Machine Learning Mastery.

]]>Because you may use this test yourself someday, it is important to have a deep understanding of how the test works. As a developer, this understanding is best achieved by implementing the hypothesis test yourself from scratch.

In this tutorial, you will discover how to implement the Student’s t-test statistical hypothesis test from scratch in Python.

After completing this tutorial, you will know:

- The Student’s t-test will comment on whether it is likely to observe two samples given that the samples were drawn from the same population.
- How to implement the Student’s t-test from scratch for two independent samples.
- How to implement the paired Student’s t-test from scratch for two dependent samples.

Let’s get started.

This tutorial is divided into three parts; they are:

- Student’s t-Test
- Student’s t-Test for Independent Samples
- Student’s t-Test for Dependent Samples

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

The Student’s t-Test is a statistical hypothesis test for testing whether two samples are expected to have been drawn from the same population.

It is named for the pseudonym “*Student*” used by William Gosset, who developed the test.

The test works by checking the means from two samples to see if they are significantly different from each other. It does this by calculating the standard error in the difference between means, which can be interpreted to see how likely the difference is, if the two samples have the same mean (the null hypothesis).

The t statistic calculated by the test can be interpreted by comparing it to critical values from the t-distribution. The critical value can be calculated using the degrees of freedom and a significance level with the percent point function (PPF).

We can interpret the statistic value in a two-tailed test, meaning that if we reject the null hypothesis, it could be because the first mean is smaller or greater than the second mean. To do this, we can calculate the absolute value of the test statistic and compare it to the positive (right tailed) critical value, as follows:

**If abs(t-statistic) <= critical value**: Accept null hypothesis that the means are equal.**If abs(t-statistic) > critical value**: Reject the null hypothesis that the means are equal.

We can also retrieve the cumulative probability of observing the absolute value of the t-statistic using the cumulative distribution function (CDF) of the t-distribution in order to calculate a p-value. The p-value can then be compared to a chosen significance level (alpha) such as 0.05 to determine if the null hypothesis can be rejected:

**If p > alpha**: Accept null hypothesis that the means are equal.**If p <= alpha**: Reject null hypothesis that the means are equal.

In working with the means of the samples, the test assumes that both samples were drawn from a Gaussian distribution. The test also assumes that the samples have the same variance, and the same size, although there are corrections to the test if these assumptions do not hold. For example, see Welch’s t-test.

There are two main versions of Student’s t-test:

**Independent Samples**. The case where the two samples are unrelated.**Dependent Samples**. The case where the samples are related, such as repeated measures on the same population. Also called a paired test.

Both the independent and the dependent Student’s t-tests are available in Python via the ttest_ind() and ttest_rel() SciPy functions respectively.

**Note**: I recommend using these SciPy functions to calculate the Student’s t-test for your applications, if they are suitable. The library implementations will be faster and less prone to bugs. I would only recommend implementing the test yourself for learning purposes or in the case where you require a modified version of the test.

We will use the SciPy functions to confirm the results from our own version of the tests.

Note, for reference, all calculations presented in this tutorial are taken directly from Chapter 9 “*t Tests*” in “Statistics in Plain English“, Third Edition, 2010. I mention this because you may see the equations with different forms, depending on the reference text that you use.

We’ll start with the most common form of the Student’s t-test: the case where we are comparing the means of two independent samples.

The calculation of the t-statistic for two independent samples is as follows:

t = observed difference between sample means / standard error of the difference between the means

or

t = (mean(X1) - mean(X2)) / sed

Where *X1* and *X2* are the first and second data samples and *sed* is the standard error of the difference between the means.

The standard error of the difference between the means can be calculated as follows:

sed = sqrt(se1^2 + se2^2)

Where *se1* and *se2* are the standard errors for the first and second datasets.

The standard error of a sample can be calculated as:

se = std / sqrt(n)

Where *se* is the standard error of the sample, *std* is the sample standard deviation, and *n* is the number of observations in the sample.

These calculations make the following assumptions:

- The samples are drawn from a Gaussian distribution.
- The size of each sample is approximately equal.
- The samples have the same variance.

We can implement these equations easily using functions from the Python standard library, NumPy and SciPy.

Let’s assume that our two data samples are stored in the variables *data1* and *data2*.

We can start off by calculating the mean for these samples as follows:

# calculate means mean1, mean2 = mean(data1), mean(data2)

We’re halfway there.

Now we need to calculate the standard error.

We can do this manually, first by calculating the sample standard deviations:

# calculate sample standard deviations std1, std2 = std(data1, ddof=1), std(data2, ddof=1)

And then the standard errors:

# calculate standard errors n1, n2 = len(data1), len(data2) se1, se2 = std1/sqrt(n1), std2/sqrt(n2)

Alternately, we can use the *sem()* SciPy function to calculate the standard error directly.

# calculate standard errors se1, se2 = sem(data1), sem(data2)

We can use the standard errors of the samples to calculate the “*standard error on the difference between the samples*“:

# standard error on the difference between the samples sed = sqrt(se1**2.0 + se2**2.0)

We can now calculate the t statistic:

# calculate the t statistic t_stat = (mean1 - mean2) / sed

We can also calculate some other values to help interpret and present the statistic.

The number of degrees of freedom for the test is calculated as the sum of the observations in both samples, minus two.

# degrees of freedom df = n1 + n2 - 2

The critical value can be calculated using the percent point function (PPF) for a given significance level, such as 0.05 (95% confidence).

This function is available for the t distribution in SciPy, as follows:

# calculate the critical value alpha = 0.05 cv = t.ppf(1.0 - alpha, df)

The p-value can be calculated using the cumulative distribution function on the t-distribution, again in SciPy.

# calculate the p-value p = (1 - t.cdf(abs(t_stat), df)) * 2

Here, we assume a two-tailed distribution, where the rejection of the null hypothesis could be interpreted as the first mean is either smaller or larger than the second mean.

We can tie all of these pieces together into a simple function for calculating the t-test for two independent samples:

# function for calculating the t-test for two independent samples def independent_ttest(data1, data2, alpha): # calculate means mean1, mean2 = mean(data1), mean(data2) # calculate standard errors se1, se2 = sem(data1), sem(data2) # standard error on the difference between the samples sed = sqrt(se1**2.0 + se2**2.0) # calculate the t statistic t_stat = (mean1 - mean2) / sed # degrees of freedom df = len(data1) + len(data2) - 2 # calculate the critical value cv = t.ppf(1.0 - alpha, df) # calculate the p-value p = (1.0 - t.cdf(abs(t_stat), df)) * 2.0 # return everything return t_stat, df, cv, p

In this section we will calculate the t-test on some synthetic data samples.

First, let’s generate two samples of 100 Gaussian random numbers with the same variance of 5 and differing means of 50 and 51 respectively. We will expect the test to reject the null hypothesis and find a significant difference between the samples:

# seed the random number generator seed(1) # generate two independent samples data1 = 5 * randn(100) + 50 data2 = 5 * randn(100) + 51

We can calculate the t-test on these samples using the built in SciPy function *ttest_ind()*. This will give us a t-statistic value and a p-value to compare to, to ensure that we have implemented the test correctly.

The complete example is listed below.

# Student's t-test for independent samples from numpy.random import seed from numpy.random import randn from scipy.stats import ttest_ind # seed the random number generator seed(1) # generate two independent samples data1 = 5 * randn(100) + 50 data2 = 5 * randn(100) + 51 # compare samples stat, p = ttest_ind(data1, data2) print('t=%.3f, p=%.3f' % (stat, p))

Running the example, we can see a t-statistic value and p value.

We will use these as our expected values for the test on these data.

t=-2.262, p=0.025

We can now apply our own implementation on the same data, using the function defined in the previous section.

The function will return a t-statistic value and a critical value. We can use the critical value to interpret the t statistic to see if the finding of the test is significant and that indeed the means are different as we expected.

# interpret via critical value if abs(t_stat) <= cv: print('Accept null hypothesis that the means are equal.') else: print('Reject the null hypothesis that the means are equal.')

The function also returns a p-value. We can interpret the p-value using an alpha, such as 0.05 to determine if the finding of the test is significant and that indeed the means are different as we expected.

# interpret via p-value if p > alpha: print('Accept null hypothesis that the means are equal.') else: print('Reject the null hypothesis that the means are equal.')

We expect that both interpretations will always match.

The complete example is listed below.

# t-test for independent samples from math import sqrt from numpy.random import seed from numpy.random import randn from numpy import mean from scipy.stats import sem from scipy.stats import t # function for calculating the t-test for two independent samples def independent_ttest(data1, data2, alpha): # calculate means mean1, mean2 = mean(data1), mean(data2) # calculate standard errors se1, se2 = sem(data1), sem(data2) # standard error on the difference between the samples sed = sqrt(se1**2.0 + se2**2.0) # calculate the t statistic t_stat = (mean1 - mean2) / sed # degrees of freedom df = len(data1) + len(data2) - 2 # calculate the critical value cv = t.ppf(1.0 - alpha, df) # calculate the p-value p = (1.0 - t.cdf(abs(t_stat), df)) * 2.0 # return everything return t_stat, df, cv, p # seed the random number generator seed(1) # generate two independent samples data1 = 5 * randn(100) + 50 data2 = 5 * randn(100) + 51 # calculate the t test alpha = 0.05 t_stat, df, cv, p = independent_ttest(data1, data2, alpha) print('t=%.3f, df=%d, cv=%.3f, p=%.3f' % (t_stat, df, cv, p)) # interpret via critical value if abs(t_stat) <= cv: print('Accept null hypothesis that the means are equal.') else: print('Reject the null hypothesis that the means are equal.') # interpret via p-value if p > alpha: print('Accept null hypothesis that the means are equal.') else: print('Reject the null hypothesis that the means are equal.')

Running the example first calculates the test.

The results of the test are printed, including the t-statistic, the degrees of freedom, the critical value, and the p-value.

We can see that both the t-statistic and p-value match the outputs of the SciPy function. The test appears to be implemented correctly.

The t-statistic and the p-value are then used to interpret the results of the test. We find that as we expect, there is sufficient evidence to reject the null hypothesis, finding that the sample means are likely different.

t=-2.262, df=198, cv=1.653, p=0.025 Reject the null hypothesis that the means are equal. Reject the null hypothesis that the means are equal.

We can now look at the case of calculating the Student’s t-test for dependent samples.

This is the case where we collect some observations on a sample from the population, then apply some treatment, and then collect observations from the same sample.

The result is two samples of the same size where the observations in each sample are related or paired.

The t-test for dependent samples is referred to as the paired Student’s t-test.

The calculation of the paired Student’s t-test is similar to the case with independent samples.

The main difference is in the calculation of the denominator.

t = (mean(X1) - mean(X2)) / sed

Where *X1* and *X2* are the first and second data samples and *sed* is the standard error of the difference between the means.

Here, *sed* is calculated as:

sed = sd / sqrt(n)

Where *sd* is the standard deviation of the difference between the dependent sample means and *n* is the total number of paired observations (e.g. the size of each sample).

The calculation of *sd* first requires the calculation of the sum of the squared differences between the samples:

d1 = sum (X1[i] - X2[i])^2 for i in n

It also requires the sum of the (non squared) differences between the samples:

d2 = sum (X1[i] - X2[i]) for i in n

We can then calculate sd as:

sd = sqrt((d1 - (d2**2 / n)) / (n - 1))

That’s it.

We can implement the calculation of the paired Student’s t-test directly in Python.

The first step is to calculate the means of each sample.

# calculate means mean1, mean2 = mean(data1), mean(data2)

Next, we will require the number of pairs (*n*). We will use this in a few different calculations.

# number of paired samples n = len(data1)

Next, we must calculate the sum of the squared differences between the samples, as well as the sum differences.

# sum squared difference between observations d1 = sum([(data1[i]-data2[i])**2 for i in range(n)]) # sum difference between observations d2 = sum([data1[i]-data2[i] for i in range(n)])

We can now calculate the standard deviation of the difference between means.

# standard deviation of the difference between means sd = sqrt((d1 - (d2**2 / n)) / (n - 1))

This is then used to calculate the standard error of the difference between the means.

# standard error of the difference between the means sed = sd / sqrt(n)

Finally, we have everything we need to calculate the t statistic.

# calculate the t statistic t_stat = (mean1 - mean2) / sed

The only other key difference between this implementation and the implementation for independent samples is the calculation of the number of degrees of freedom.

# degrees of freedom df = n - 1

As before, we can tie all of this together into a reusable function. The function will take two paired samples and a significance level (alpha) and calculate the t-statistic, number of degrees of freedom, critical value, and p-value.

The complete function is listed below.

# function for calculating the t-test for two dependent samples def dependent_ttest(data1, data2, alpha): # calculate means mean1, mean2 = mean(data1), mean(data2) # number of paired samples n = len(data1) # sum squared difference between observations d1 = sum([(data1[i]-data2[i])**2 for i in range(n)]) # sum difference between observations d2 = sum([data1[i]-data2[i] for i in range(n)]) # standard deviation of the difference between means sd = sqrt((d1 - (d2**2 / n)) / (n - 1)) # standard error of the difference between the means sed = sd / sqrt(n) # calculate the t statistic t_stat = (mean1 - mean2) / sed # degrees of freedom df = n - 1 # calculate the critical value cv = t.ppf(1.0 - alpha, df) # calculate the p-value p = (1.0 - t.cdf(abs(t_stat), df)) * 2.0 # return everything return t_stat, df, cv, p

In this section, we will use the same dataset in the worked example as we did for the independent Student’s t-test.

The data samples are not paired, but we will pretend they are. We expect the test to reject the null hypothesis and find a significant difference between the samples.

# seed the random number generator seed(1) # generate two independent samples data1 = 5 * randn(100) + 50 data2 = 5 * randn(100) + 51

As before, we can evaluate the test problem with the SciPy function for calculating a paired t-test. In this case, the *ttest_rel()* function.

The complete example is listed below.

# Paired Student's t-test from numpy.random import seed from numpy.random import randn from scipy.stats import ttest_rel # seed the random number generator seed(1) # generate two independent samples data1 = 5 * randn(100) + 50 data2 = 5 * randn(100) + 51 # compare samples stat, p = ttest_rel(data1, data2) print('Statistics=%.3f, p=%.3f' % (stat, p))

Running the example calculates and prints the t-statistic and the p-value.

We will use these values to validate the calculation of our own paired t-test function.

Statistics=-2.372, p=0.020

We can now test our own implementation of the paired Student’s t-test.

The complete example, including the developed function and interpretation of the results of the function, is listed below.

# t-test for dependent samples from math import sqrt from numpy.random import seed from numpy.random import randn from numpy import mean from scipy.stats import t # function for calculating the t-test for two dependent samples def dependent_ttest(data1, data2, alpha): # calculate means mean1, mean2 = mean(data1), mean(data2) # number of paired samples n = len(data1) # sum squared difference between observations d1 = sum([(data1[i]-data2[i])**2 for i in range(n)]) # sum difference between observations d2 = sum([data1[i]-data2[i] for i in range(n)]) # standard deviation of the difference between means sd = sqrt((d1 - (d2**2 / n)) / (n - 1)) # standard error of the difference between the means sed = sd / sqrt(n) # calculate the t statistic t_stat = (mean1 - mean2) / sed # degrees of freedom df = n - 1 # calculate the critical value cv = t.ppf(1.0 - alpha, df) # calculate the p-value p = (1.0 - t.cdf(abs(t_stat), df)) * 2.0 # return everything return t_stat, df, cv, p # seed the random number generator seed(1) # generate two independent samples (pretend they are dependent) data1 = 5 * randn(100) + 50 data2 = 5 * randn(100) + 51 # calculate the t test alpha = 0.05 t_stat, df, cv, p = dependent_ttest(data1, data2, alpha) print('t=%.3f, df=%d, cv=%.3f, p=%.3f' % (t_stat, df, cv, p)) # interpret via critical value if abs(t_stat) <= cv: print('Accept null hypothesis that the means are equal.') else: print('Reject the null hypothesis that the means are equal.') # interpret via p-value if p > alpha: print('Accept null hypothesis that the means are equal.') else: print('Reject the null hypothesis that the means are equal.')

Running the example calculates the paired t-test on the sample problem.

The calculated t-statistic and p-value match what we expect from the SciPy library implementation. This suggests that the implementation is correct.

The interpretation of the t-test statistic with the critical value, and the p-value with the significance level both find a significant result, rejecting the null hypothesis that the means are equal.

t=-2.372, df=99, cv=1.660, p=0.020 Reject the null hypothesis that the means are equal. Reject the null hypothesis that the means are equal.

This section lists some ideas for extending the tutorial that you may wish to explore.

- Apply each test to your own contrived sample problem.
- Update the independent test and add the correction for samples with different variances and sample sizes.
- Perform a code review of one of the tests implemented in the SciPy library and summarize the differences in the implementation details.

If you explore any of these extensions, I’d love to know.

This section provides more resources on the topic if you are looking to go deeper.

- Statistics in Plain English, Third Edition, 2010.

In this tutorial, you discovered how to implement the Student’s t-test statistical hypothesis test from scratch in Python.

Specifically, you learned:

- The Student’s t-test will comment on whether it is likely to observe two samples given that the samples were drawn from the same population.
- How to implement the Student’s t-test from scratch for two independent samples.
- How to implement the paired Student’s t-test from scratch for two dependent samples.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Code the Student’s t-Test from Scratch in Python appeared first on Machine Learning Mastery.

]]>The post How to Calculate McNemar’s Test to Compare Two Machine Learning Classifiers appeared first on Machine Learning Mastery.

]]>In his widely cited 1998 paper, Thomas Dietterich recommended the McNemar’s test in those cases where it is expensive or impractical to train multiple copies of classifier models.

This describes the current situation with deep learning models that are both very large and are trained and evaluated on large datasets, often requiring days or weeks to train a single model.

In this tutorial, you will discover how to use the McNemar’s statistical hypothesis test to compare machine learning classifier models on a single test dataset.

After completing this tutorial, you will know:

- The recommendation of the McNemar’s test for models that are expensive to train, which suits large deep learning models.
- How to transform prediction results from two classifiers into a contingency table and how the table is used to calculate the statistic in the McNemar’s test.
- How to calculate the McNemar’s test in Python and interpret and report the result.

Let’s get started.

This tutorial is divided into five parts; they are:

- Statistical Hypothesis Tests for Deep Learning
- Contingency Table
- McNemar’s Test Statistic
- Interpret the McNemar’s Test for Classifiers
- McNemar’s Test in Python

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

In his important and widely cited 1998 paper on the use of statistical hypothesis tests to compare classifiers titled “Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms“, Thomas Dietterich recommends the use of the McNemar’s test.

Specifically, the test is recommended in those cases where the algorithms that are being compared can only be evaluated once, e.g. on one test set, as opposed to repeated evaluations via a resampling technique, such as k-fold cross-validation.

For algorithms that can be executed only once, McNemar’s test is the only test with acceptable Type I error.

— Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithm, 1998.

Specifically, Dietterich’s study was concerned with the evaluation of different statistical hypothesis tests, some operating upon the results from resampling methods. The concern of the study was low Type I error, that is, the statistical test reporting an effect when in fact no effect was present (false positive).

Statistical tests that can compare models based on a single test set is an important consideration for modern machine learning, specifically in the field of deep learning.

Deep learning models are often large and operate on very large datasets. Together, these factors can mean that the training of a model can take days or even weeks on fast modern hardware.

This precludes the practical use of resampling methods to compare models and suggests the need to use a test that can operate on the results of evaluating trained models on a single test dataset.

The McNemar’s test may be a suitable test for evaluating these large and slow-to-train deep learning models.

The McNemar’s test operates upon a contingency table.

Before we dive into the test, let’s take a moment to understand how the contingency table for two classifiers is calculated.

A contingency table is a tabulation or count of two categorical variables. In the case of the McNemar’s test, we are interested in binary variables correct/incorrect or yes/no for a control and a treatment or two cases. This is called a 2×2 contingency table.

The contingency table may not be intuitive at first glance. Let’s make it concrete with a worked example.

Consider that we have two trained classifiers. Each classifier makes binary class prediction for each of the 10 examples in a test dataset. The predictions are evaluated and determined to be correct or incorrect.

We can then summarize these results in a table, as follows:

Instance, Classifier1 Correct, Classifier2 Correct 1 Yes No 2 No No 3 No Yes 4 No No 5 Yes Yes 6 Yes Yes 7 Yes Yes 8 No No 9 Yes No 10 Yes Yes

We can see that Classifier1 got 6 correct, or an accuracy of 60%, and Classifier2 got 5 correct, or 50% accuracy on the test set.

The table can now be reduced to a contingency table.

The contingency table relies on the fact that both classifiers were trained on exactly the same training data and evaluated on exactly the same test data instances.

The contingency table has the following structure:

Classifier2 Correct, Classifier2 Incorrect Classifier1 Correct ?? ?? Classifier1 Incorrect ?? ??

In the case of the first cell in the table, we must sum the total number of test instances that Classifier1 got correct and Classifier2 got correct. For example, the first instance that both classifiers predicted correctly was instance number 5. The total number of instances that both classifiers predicted correctly was 4.

Another more programmatic way to think about this is to sum each combination of Yes/No in the results table above.

Classifier2 Correct, Classifier2 Incorrect Classifier1 Correct Yes/Yes Yes/No Classifier1 Incorrect No/Yes No/No

The results organized into a contingency table are as follows:

Classifier2 Correct, Classifier2 Incorrect Classifier1 Correct 4 2 Classifier1 Incorrect 1 3

McNemar’s test is a paired nonparametric or distribution-free statistical hypothesis test.

It is also less intuitive than some other statistical hypothesis tests.

The McNemar’s test is checking if the disagreements between two cases match. Technically, this is referred to as the homogeneity of the contingency table (specifically the marginal homogeneity). Therefore, the McNemar’s test is a type of homogeneity test for contingency tables.

The test is widely used in medicine to compare the effect of a treatment against a control.

In terms of comparing two binary classification algorithms, the test is commenting on whether the two models disagree in the same way (or not). It is not commenting on whether one model is more or less accurate or error prone than another. This is clear when we look at how the statistic is calculated.

The McNemar’s test statistic is calculated as:

statistic = (Yes/No - No/Yes)^2 / (Yes/No + No/Yes)

Where Yes/No is the count of test instances that Classifier1 got correct and Classifier2 got incorrect, and No/Yes is the count of test instances that Classifier1 got incorrect and Classifier2 got correct.

This calculation of the test statistic assumes that each cell in the contingency table used in the calculation has a count of at least 25. The test statistic has a Chi-Squared distribution with 1 degree of freedom.

We can see that only two elements of the contingency table are used, specifically that the Yes/Yes and No/No elements are not used in the calculation of the test statistic. As such, we can see that the statistic is reporting on the different correct or incorrect predictions between the two models, not the accuracy or error rates. This is important to understand when making claims about the finding of the statistic.

The default assumption, or null hypothesis, of the test is that the two cases disagree to the same amount. If the null hypothesis is rejected, it suggests that there is evidence to suggest that the cases disagree in different ways, that the disagreements are skewed.

Given the selection of a significance level, the p-value calculated by the test can be interpreted as follows:

**p > alpha**: fail to reject H0, no difference in the disagreement (e.g. treatment had no effect).**p <= alpha**: reject H0, significant difference in the disagreement (e.g. treatment had an effect).

It is important to take a moment to clearly understand how to interpret the result of the test in the context of two machine learning classifier models.

The two terms used in the calculation of the McNemar’s Test capture the errors made by both models. Specifically, the No/Yes and Yes/No cells in the contingency table. The test checks if there is a significant difference between the counts in these two cells. That is all.

If these cells have counts that are similar, it shows us that both models make errors in much the same proportion, just on different instances of the test set. In this case, the result of the test would not be significant and the null hypothesis would not be rejected.

Under the null hypothesis, the two algorithms should have the same error rate …

— Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithm, 1998.

If these cells have counts that are not similar, it shows that both models not only make different errors, but in fact have a different relative proportion of errors on the test set. In this case, the result of the test would be significant and we would reject the null hypothesis.

So we may reject the null hypothesis in favor of the hypothesis that the two algorithms have different performance when trained on the particular training

— Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithm, 1998.

We can summarize this as follows:

**Fail to Reject Null Hypothesis**: Classifiers have a similar proportion of errors on the test set.**Reject Null Hypothesis**: Classifiers have a different proportion of errors on the test set.

After performing the test and finding a significant result, it may be useful to report an effect statistical measure in order to quantify the finding. For example, a natural choice would be to report the odds ratios, or the contingency table itself, although both of these assume a sophisticated reader.

It may be useful to report the difference in error between the two classifiers on the test set. In this case, be careful with your claims as the significant test does not report on the difference in error between the models, only the relative difference in the proportion of error between the models.

Finally, in using the McNemar’s test, Dietterich highlights two important limitations that must be considered. They are:

Generally, model behavior varies based on the specific training data used to fit the model.

This is due to both the interaction of the model with specific training instances and the use of randomness during learning. Fitting the model on multiple different training datasets and evaluating the skill, as is done with resampling methods, provides a way to measure the variance of the model.

The test is appropriate if the sources of variability are small.

Hence, McNemar’s test should only be applied if we believe these sources of variability are small.

— Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithm, 1998.

The two classifiers are evaluated on a single test set, and the test set is expected to be smaller than the training set.

This is different from hypothesis tests that make use of resampling methods as more, if not all, of the dataset is made available as a test set during evaluation (which introduces its own problems from a statistical perspective).

This provides less of an opportunity to compare the performance of the models. It requires that the test set is an appropriately representative of the domain, often meaning that the test dataset is large.

The McNemar’s test can be implemented in Python using the mcnemar() Statsmodels function.

The function takes the contingency table as an argument and returns the calculated test statistic and p-value.

There are two ways to use the statistic depending on the amount of data.

If there is a cell in the table that is used in the calculation of the test statistic that has a count of less than 25, then a modified version of the test is used that calculates an exact p-value using a binomial distribution. This is the default usage of the test:

stat, p = mcnemar(table, exact=True)

Alternately, if all cells used in the calculation of the test statistic in the contingency table have a value of 25 or more, then the standard calculation of the test can be used.

stat, p = mcnemar(table, exact=False, correction=True)

We can calculate the McNemar’s on the example contingency table described above. This contingency table has a small count in both the disagreement cells and as such the exact method must be used.

The complete example is listed below.

# Example of calculating the mcnemar test from statsmodels.stats.contingency_tables import mcnemar # define contingency table table = [[4, 2], [1, 3]] # calculate mcnemar test result = mcnemar(table, exact=True) # summarize the finding print('statistic=%.3f, p-value=%.3f' % (result.statistic, result.pvalue)) # interpret the p-value alpha = 0.05 if result.pvalue > alpha: print('Same proportions of errors (fail to reject H0)') else: print('Different proportions of errors (reject H0)')

Running the example calculates the statistic and p-value on the contingency table and prints the results.

We can see that the test strongly confirms that there is very little difference in the disagreements between the two cases. The null hypothesis not rejected.

As we are using the test to compare classifiers, we state that there is no statistically significant difference in the disagreements between the two models.

statistic=1.000, p-value=1.000 Same proportions of errors (fail to reject H0)

This section lists some ideas for extending the tutorial that you may wish to explore.

- Find a research paper in machine learning that makes use of the McNemar’s statistical hypothesis test.
- Update the code example such that the contingency table shows a significant difference in disagreement between the two cases.
- Implement a function that will use the correct version of the McNemar’s test based on the provided contingency table.

If you explore any of these extensions, I’d love to know.

This section provides more resources on the topic if you are looking to go deeper.

- Note on the sampling error of the difference between correlated proportions or percentages, 1947.
- Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms, 1998.

In this tutorial, you discovered how to use the McNemar’s test statistical hypothesis test to compare machine learning classifier models on a single test dataset.

Specifically, you learned:

- The recommendation of the McNemar’s test for models that are expensive to train, which suits large deep learning models.
- How to transform prediction results from two classifiers into a contingency table and how the table is used to calculate the statistic in the McNemar’s test.
- How to calculate the McNemar’s test in Python and interpret and report the result.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Calculate McNemar’s Test to Compare Two Machine Learning Classifiers appeared first on Machine Learning Mastery.

]]>The post The Role of Randomization to Address Confounding Variables in Machine Learning appeared first on Machine Learning Mastery.

]]>A challenge is that there are aspects of the problem and the algorithm called confounding variables that cannot be controlled (held constant) and must be controlled-for. An example is the use of randomness in a learning algorithm, such as random initialization or random choices during learning.

The solution is to use randomness in a way that has become a standard in applied machine learning. We can learn more about the rationale for using randomness in controlled experiments by looking briefly at why randomness is used to manage confounding variables in medicine through the use of randomized clinical trials.

In this post, you will discover confounding variables and how we can address them using the tool of randomization.

After reading this post, you will know:

- Confounding variables correlated with the independent and dependent variable confuse the effects and impact the results of experiments.
- Applied machine learning is concerned with controlled experiments that do suffer known confounding variables.
- Randomization of experiments is the key to controlling for confounding variables in machine learning experiments.

Let’s get started.

This post is divided into four parts;l they are:

- Confounding Variables
- Confounding in Machine Learning
- Randomization of Experiments
- Randomization in Machine Learning

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

In an experiment, we are often interested in the effect of an independent variable on a dependent variable.

A confounding variable is a variable that confuses the relationship between the independent and the dependent variable.

Confounding, sometimes referred to as confounding bias, is mostly described as a ‘mixing’ or ‘blurring’ of effects.

— Confounding: What it is and how to deal with it, 2008.

A confounding variable can influence the outcome of an experiment in many ways, such as:

- Invalid correlations.
- Increasing variance.
- Introducing a bias.

A confounding variable may be known or unknown.

They are often characterized as having an association or correlation with both the independent and dependent variables.

Another characterization is that the confounding variable affects groups or observations differently.

Confounding variables or confounders are often defined as the variables correlate (positively or negatively) with both the dependent variable and the independent variable. A Confounder is an extraneous variable whose presence affects the variables being studied so that the results do not reflect the actual relationship between the variables under study.

— How to control confounding effects by statistical analysis, 2012.

The deeper difficulty of confounding variables is that it may not be obvious that they exist and are impacting results.

The effects of confounding variables are often not obvious or even identifiable unless they are specifically addressed in the design of the experiment or data collection method.

Confounding variables are traditionally a concern in applied statistics.

This is because in statistics we are often concerned with the effect of independent variables on dependent variables in data. Statistical methods are designed to discover and describe these relationships and confounding variables can essentially corrupt or invalidate discoveries.

Machine learning practitioners are typically interested in the skill of a predictive model and less concerned with the statistical correctness or interpretability of the model. As such, confounding variables are an important topic when it comes to data selection and preparation, but less important than they may be when developing descriptive statistical models.

Nevertheless, confounding variables are critically important in applied machine learning.

The evaluation of a machine learning model is an experiment with independent and dependent variables. As such, it is subject to confounding variables.

What may be surprising is that you already know this and that the gold-standard practices in applied machine learning address this. Therefore, being intimately aware of the confounding variables in machine learning experiments is required to understand the choice and interpretation of machine learning model evaluation.

Consider, what impacts the evaluation of a machine learning model, what are the independent variables?

Some examples include:

- The choice of data preparation schemes.
- The choice of the samples in the training dataset.
- The choice of the samples in the test dataset.
- The choice of learning algorithm.
- The choice of the initialization of the learning algorithm.
- The choice of the configuration of the learning algorithm.

Each of these choices will impact the dependent variable in a machine learning experiment, which is the chosen metric used to estimate the skill of the model when making predictions.

The evaluation of a machine learning model involves the design and execution of controlled experiments. A controlled experiment holds all elements constant except one element under study. The two most common types of controlled experiments in machine learning are:

- Controlled experiments to vary and evaluate learning algorithms.
- Controlled experiments to vary and evaluate learning algorithm configurations.

Nevertheless, there are confounding variables that the controlled experiments cannot hold constant. Specifically, there are sources of randomness, that if they were held constant would result in an invalid evaluation of the model. Three examples include:

- Randomness in the data sample.
- Randomness in model initialization.
- Randomness in the learning algorithm.

For example, weights in a neural network are initialized to random values. Stochastic gradient descent randomizes the order of samples in an epoch to vary the types of updates performed. Random subsets of features are selected for each possible cut point in random forest. And many more examples.

Randomization in machine learning algorithms is not a bug; it is a feature intended to improve the performance of the model on average over classical deterministic methods.

Randomness can be present in ML at many different levels, usually enhancing performance or alleviating problems and difficulties of classical methods.

— Randomized Machine Learning Approaches: Recent Developments and Challenges, 2017.

These are confounding variables that we cannot hold constant. If they are held constant, the evaluation of the model will no longer provide insight into the generalizability of the result. We will know how well the model performs on a specific data sample or initialization of sequence of decisions during learning, but little idea on how the model will perform in general.

The way that we can handle confounding variables that we cannot control is by using randomization.

Randomization is a technique used in experimental design to give control over confounding variables that cannot (should not) be held constant.

For example, randomization is used in clinical experiments to control-for the biological differences between individual human beings when evaluating a treatment. It is the reason why a treatment must be evaluated on multiple individuals rather than on a single individual before the findings can be generalized.

In randomization the random assignment of study subjects to exposure categories to breaking any links between exposure and confounders. This reduces potential for confounding by generating groups that are fairly comparable with respect to known and unknown confounding variables.

— How to control confounding effects by statistical analysis, 2012.

Randomization is a simple tool in experimental design that allows the confounding variables to have their effect across a sample. It shifts the experiment from looking at an individual case to a collection of observations, where statistical tools are used to interpret the finding.

In medicine, randomization is the gold standard for evaluating a treatment and is called the randomized clinical trial. It is designed to remove not only the confounding effects of biological differences, but also the bias, such as the effect of the experimenter choosing the members of the treatment and non-treatment groups. You can imagine that a treatment would look very successful if the least-sick members of a cohort were chosen to be administered.

An [Randomized clinical trial] is a special kind of cohort study, with the characteristic that patients are randomly assigned to the experimental group (with exposure) and the control group (without exposure). […] Therefore, randomization helps to prevent selection by the clinician, and helps to establish groups that are equal with respect to relevant prognostic factors.

— The randomized clinical trial: An unbeatable standard in clinical research?, 2007.

There are still confounding variables when using a randomized clinical trial. An example is the case where the experimenters know what treatment participants of the study are receiving. This can impact the way the experimenters interact with the participants, which in turn can impact the results of the experiment.

The answer is to use blinding where participants or experimenters do not know the treatment. Ideally, a double-blind experiment is adopted, ensuring that both participates and experimenters are unaware of their treatment.

When feasible, it is strongly recommended that also after randomization, patients and clinicians do not know who receives the intervention and who does not. Studies may be single blind (either the patient or the clinician does not know who receives the treatment and who does not) or double blind (both the patient and the clinician do not know who receives the treatment).

— The randomized clinical trial: An unbeatable standard in clinical research?, 2007.

Note, before we move on to look at the use of randomization in machine learning, consider that there are other approaches to managing the effect of confounding variables. Wikipedia has a good list here.

Randomization is used in the evaluation of machine learning models to manage the uncontrollable confounding variables.

It is key to the standard ways described for evaluating machine learning models and the rationale for using methods such as data resampling and repeating experiments.

- Resampling methods are used to randomize the training and test datasets to help estimate the skill of training and evaluating models on random samples of data from the domain, rather than on a specific sample of data.
- Evaluation experiments are repeated to help estimate the skill of the model with different random initialization and learning decisions, rather than on a single set of initial conditions and sequence of learning decisions.

Randomization allows the machine learning practitioner to generalize a finding, to make it useful and applicable. It’s the reason why careful design of the test harness and resampling method is important. It is the reason why we repeat the evaluation of a model and the reason we don’t fix the seed on the pseudorandom number generator.

I talk more about these topics in the posts:

When we take a closer look at why we use randomization, to control for confounding variables, it raises questions about the other confounders that we may not be controlling for.

For example, the machine learning practitioner knowing the skill of models prior to giving each model a chance to do its best via data preparation and hyperparameter tuning. Perhaps practitioners should blind themselves to remove the possibility of biasing the choice of final model.

The risk is that the practitioner that really likes artificial neural networks will “*discover*” a neural network configuration that outperforms other models.

At best it is a statistical fluke or violation of Occam’s Razor for a parsimonious solution to a predictive modeling project; at worst, it is scientific fraud. The reason that clinicians aggressively removed this bias is people’s lives were at risk. We may get to that point with machine learning algorithms, e.g. in cars.

In practice, today, I think this is good motivation for front-loading an experiment with a large and careful design and automating the execution and statistical interpretation of the results.

This section provides more resources on the topic if you are looking to go deeper.

- Confounding on Wikipedia
- Controlling for a variable on Wikipedia
- Randomized controlled trial on Wikipedia
- The randomized clinical trial: An unbeatable standard in clinical research?, 2007.
- Confounding: What it is and how to deal with it, 2008.
- Confounding variables in machine learning predictions? on Cross Validated
- How to control confounding effects by statistical analysis, 2012.
- Randomized Machine Learning Approaches: Recent Developments and Challenges, 2017.

In this post, you discovered confounding variables and how we can address them using the tool of randomization.

Specifically, you learned:

- Confounding variables correlated with the independent and dependent variable and confuse the effects and impact the results of experiments.
- Applied machine learning is concerned with controlled experiments that do suffer known confounding variables.
- Randomization of experiments is the key to controlling for confounding variables in machine learning experiments.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post The Role of Randomization to Address Confounding Variables in Machine Learning appeared first on Machine Learning Mastery.

]]>The post All of Statistics for Machine Learning appeared first on Machine Learning Mastery.

]]>The book “All of Statistics” was written specifically to provide a foundation in probability and statistics for computer science undergraduates that may have an interest in data mining and machine learning. As such, it is often recommended as a book to machine learning practitioners interested in expanding their understanding of statistics.

In this post, you will discover the book “All of Statistics”, the topics it covers, and a reading list intended for machine learning practitioners.

After reading this post, you will know:

- Larry Wasserman wrote “
*All of Statistics*” to quickly bring computer science students up to speed with probability and statistics. - The book provides a broad coverage of the field of statistics with a focus on the mathematical presentation of the topics covered.
- The book covers much more than is required by machine learning practitioners, but a select reading of topics will be helpful for those that prefer a mathematical treatment.

Let’s get started.

The book “All of Statistics: A Concise Course in Statistical Inference” was written by Larry Wasserman and released in 2004.

Wasserman is a professor of statistics and data science at Carnegie Mellon University.

The book is ambitious.

It seeks to quickly bring computer science students up-to-speed with probability and statistics. As such, the topics covered by the book are very broad, perhaps broader than the average introductory textbooks.

Taken literally, the title “All of Statistics” is an exaggeration. But in spirit, the title is apt, as the book does cover a much broader range of topics than a typical introductory book on mathematical statistics. This book is for people who want to learn probability and statistics quickly.

— Page vii, All of Statistics: A Concise Course in Statistical Inference, 2004.

The book is not for the average practitioner; it is intended for computer science undergraduate students. It does assume some prior knowledge in calculus and linear algebra. If you don’t like equations or mathematical notation, this book is not for you.

Interestingly, Wasserman wrote the book in response to the rise of data mining and machine learning in computer science occurring outside of classical statistics. He asserts in the preface the importance of having a grounding in statistics in order to be effective in machine learning.

Using fancy tools like neural nets, boosting, and support vector machines without understanding basic statistics is like doing brain surgery before knowing how to use a band-aid.

— Pages vii-viii, All of Statistics: A Concise Course in Statistical Inference, 2004.

The material is presented in a very clear and concise manner. A systematic approach is taken with brief descriptions of a method, equations describing its implementation, and worked examples to motivate the use of the method with sample code in R.

In fact, the material is so compact that it often reads like a series of encyclopedia examples. This is great if you want to know how to implement a method, but very challenging if you are new to the methods and seeking intuitions.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

The choice of topics covered by the book is very broad, as mentioned in the previous section.

This is great on the one hand as the reader is given exposure to advanced subjects early on. The downside of this aggressive scope is that topics are touched on briefly with very little hand holding. You are left to re-read sections until you get it.

Let’s look at the topics covered by the book.

This is helpful to both get an idea of the presented scope of the field and the context for the topics that may interest you as a machine learning practitioner.

The book is divided into three parts; they are:

- I Probability
- II Statistical Inference
- III Statistical Models and Methods

The first part of the book focuses on probability theory and formal language for describing uncertainty. The second part is focused on statistical inference. The third part focuses on specific methods and problems raised in the second part.

The book does have a reference or encyclopedia feeling. As such, there are a lot of chapters, but each chapter is reasonably standalone. The book is divided into 24 chapters; they are:

- Chapter 1: Probability
- Chapter 2: Random Variables
- Chapter 3: Expectation
- Chapter 4: Inequalities
- Chapter 5: Convergence of Random Variables
- Chapter 6: Models, Statistical Inference and Learning
- Chapter 7: Estimating the CDF and Statistical Functions
- Chapter 8: The Bootstrap
- Chapter 9: Parametric Inference
- Chapter 10: Hypothesis Testing and p-values
- Chapter 11: Bayesian Inference
- Chapter 12: Statistical Decision Theory
- Chapter 13: Linear and Logistic Regression
- Chapter 14: Multivariate Models
- Chapter 15: Inference About Independence
- Chapter 16: Causal Inference
- Chapter 17: Directed Graphs and Conditional Independence
- Chapter 18: Undirected Graphs
- Chapter 19: Log-Linear Models
- Chapter 20: Nonparametric Curve Estimation
- Chapter 21: Smoothing Using Orthogonal Functions
- Chapter 22: Classification
- Chapter 23: Probability Redux: Stochastic Processes

Chapter 24: Simulation Methods

The preface for the book provides a useful glossary of terms mapping them from statistics to computer science. This “Statistics/Data Mining Dictionary” is reproduced below.

All of the R code and datasets used in the worked examples in the book are available from Wasserman’s homepage. This is very helpful as you can focus on experimenting with the examples rather than typing in the code and hoping that you got the syntax correct.

I would not recommend this book to developers who have not touched statistics before. It’s too challenging.

I would recommend this book to computer science students who are in math-learning-mode. I would also recommend it to machine learning practitioners with some previous background in statistics or a strong mathematical foundation.

If you are comfortable with mathematical notation and you know what you’re looking for, this book is an excellent reference. You can flip to the topic or the method and get a crisp presentation.

The problem is, for a machine learning practitioner, you do need to know about many of these topics, just not at the level of detail presented. Perhaps a shade lighter, at the intuition level. If you are up to it, it would be worth reading (or skimming) the following chapters in order to build a solid foundation in probability for statistics:

- Chapter 1: Probability
- Chapter 2: Random Variables
- Chapter 3: Expectation
- Chapter 5: Convergence of Random Variables

Again, these are important topics, but you require a concept-level understanding only.

For coverage of statistical hypothesis tests that you may use to interpret data and compare the skill of models, the following chapters are recommended reading:

- Chapter 6: Models, Statistical Inference and Learning
- Chapter 9: Parametric Inference
- Chapter 10: Hypothesis Testing and p-values

I would also recommend the chapter on the Bootstrap. It’s just a great method to have in your head, but with a focus for either better understanding bagging and random forest or as a procedure for estimating confidence intervals of model skill.

- Chapter 8: The Bootstrap

Finally, a statistical approach is used to present machine learning algorithms. I would recommend these chapters if you prefer a more mathematical treatment of regression and classification algorithms:

- Chapter 12: Statistical Decision Theory
- Chapter 13: Linear and Logistic Regression
- Chapter 22: Classification

I can read the mathematical presentation of statistics, but I prefer intuitions and working code. I am less likely to pick up this book from my bookcase, in favor of gentler treatments such as “Statistics in Plain English” or application focused treatments such as “Empirical Methods for Artificial Intelligence“.

Do you agree with this reading list?

Let me know in the comments below.

This section provides more resources on the topic if you are looking to go deeper.

- All of Statistics: A Concise Course in Statistical Inference, 2004.
- All of Statistics Errata
- Larry Wasserman Homepage

In this post, you discovered the book “*All of Statistics*” that provides a broad and concise introduction to statistics.

Specifically, you learned:

- Larry Wasserman wrote “
*All of Statistics*” to quickly bring computer science students up to speed with probability and statistics. - The book provides a broad coverage of the field of statistics with a focus on the mathematical presentation of the topics covered.
- The book covers much more than is required by machine learning practitioners, but a select reading of topics will be helpful for those that prefer a mathematical treatment.

Have you read this book?

What did you think of it? Let me know in the comments below.

Are you thinking of picking up a copy of this book?

Let me know in the comments.

The post All of Statistics for Machine Learning appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to Statistical Power and Power Analysis in Python appeared first on Machine Learning Mastery.

]]>Power can be calculated and reported for a completed experiment to comment on the confidence one might have in the conclusions drawn from the results of the study. It can also be used as a tool to estimate the number of observations or sample size required in order to detect an effect in an experiment.

In this tutorial, you will discover the importance of the statistical power of a hypothesis test and now to calculate power analyses and power curves as part of experimental design.

After completing this tutorial, you will know:

- Statistical power is the probability of a hypothesis test of finding an effect if there is an effect to be found.
- A power analysis can be used to estimate the minimum sample size required for an experiment, given a desired significance level, effect size, and statistical power.
- How to calculate and plot power analysis for the Student’s t test in Python in order to effectively design an experiment.

Let’s get started.

This tutorial is divided into four parts; they are:

- Statistical Hypothesis Testing
- What Is Statistical Power?
- Power Analysis
- Student’s t Test Power Analysis

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

A statistical hypothesis test makes an assumption about the outcome, called the null hypothesis.

For example, the null hypothesis for the Pearson’s correlation test is that there is no relationship between two variables. The null hypothesis for the Student’s t test is that there is no difference between the means of two populations.

The test is often interpreted using a p-value, which is the probability of observing the result given that the null hypothesis is true, not the reverse, as is often the case with misinterpretations.

**p-value (p)**: Probability of obtaining a result equal to or more extreme than was observed in the data.

In interpreting the p-value of a significance test, you must specify a significance level, often referred to as the Greek lower case letter alpha (a). A common value for the significance level is 5% written as 0.05.

The p-value is interested in the context of the chosen significance level. A result of a significance test is claimed to be “*statistically significant*” if the p-value is less than the significance level. This means that the null hypothesis (that there is no result) is rejected.

**p <= alpha**: reject H0, different distribution.**p > alpha**: fail to reject H0, same distribution.

Where:

**Significance level (alpha)**: Boundary for specifying a statistically significant finding when interpreting the p-value.

We can see that the p-value is just a probability and that in actuality the result may be different. The test could be wrong. Given the p-value, we could make an error in our interpretation.

There are two types of errors; they are:

**Type I Error**. Reject the null hypothesis when there is in fact no significant effect (false positive). The p-value is optimistically small.**Type II Error**. Not reject the null hypothesis when there is a significant effect (false negative). The p-value is pessimistically large.

In this context, we can think of the significance level as the probability of rejecting the null hypothesis if it were true. That is the probability of making a Type I Error or a false positive.

Statistical power, or the power of a hypothesis test is the probability that the test correctly rejects the null hypothesis.

That is, the probability of a true positive result. It is only useful when the null hypothesis is rejected.

… statistical power is the probability that a test will correctly reject a false null hypothesis. Statistical power has relevance only when the null is false.

— Page 60, The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis, and the Interpretation of Research Results, 2010.

The higher the statistical power for a given experiment, the lower the probability of making a Type II (false negative) error. That is the higher the probability of detecting an effect when there is an effect. In fact, the power is precisely the inverse of the probability of a Type II error.

Power = 1 - Type II Error Pr(True Positive) = 1 - Pr(False Negative)

More intuitively, the statistical power can be thought of as the probability of accepting an alternative hypothesis, when the alternative hypothesis is true.

When interpreting statistical power, we seek experiential setups that have high statistical power.

**Low Statistical Power**: Large risk of committing Type II errors, e.g. a false negative.**High Statistical Power**: Small risk of committing Type II errors.

Experimental results with too low statistical power will lead to invalid conclusions about the meaning of the results. Therefore a minimum level of statistical power must be sought.

It is common to design experiments with a statistical power of 80% or better, e.g. 0.80. This means a 20% probability of encountering a Type II area. This different to the 5% likelihood of encountering a Type I error for the standard value for the significance level.

Statistical power is one piece in a puzzle that has four related parts; they are:

**Effect Size**. The quantified magnitude of a result present in the population. Effect size is calculated using a specific statistical measure, such as Pearson’s correlation coefficient for the relationship between variables or Cohen’s d for the difference between groups.**Sample Size**. The number of observations in the sample.**Significance**. The significance level used in the statistical test, e.g. alpha. Often set to 5% or 0.05.**Statistical Power**. The probability of accepting the alternative hypothesis if it is true.

All four variables are related. For example, a larger sample size can make an effect easier to detect, and the statistical power can be increased in a test by decreasing the significance level.

A power analysis involves estimating one of these four parameters given values for three other parameters. This is a powerful tool in both the design and in the analysis of experiments that we wish to interpret using statistical hypothesis tests.

For example, the statistical power can be estimated given an effect size, sample size and significance level. Alternately, the sample size can be estimated given different desired levels of significance.

Power analysis answers questions like “how much statistical power does my study have?” and “how big a sample size do I need?”.

— Page 56, The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis, and the Interpretation of Research Results, 2010.

Perhaps the most common use of a power analysis is in the estimation of the minimum sample size required for an experiment.

Power analyses are normally run before a study is conducted. A prospective or a priori power analysis can be used to estimate any one of the four power parameters but is most often used to estimate required sample sizes.

— Page 57, The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis, and the Interpretation of Research Results, 2010.

As a practitioner, we can start with sensible defaults for some parameters, such as a significance level of 0.05 and a power level of 0.80. We can then estimate a desirable minimum effect size, specific to the experiment being performed. A power analysis can then be used to estimate the minimum sample size required.

In addition, multiple power analyses can be performed to provide a curve of one parameter against another, such as the change in the size of an effect in an experiment given changes to the sample size. More elaborate plots can be created varying three of the parameters. This is a useful tool for experimental design.

We can make the idea of statistical power and power analysis concrete with a worked example.

In this section, we will look at the Student’s t test, which is a statistical hypothesis test for comparing the means from two samples of Gaussian variables. The assumption, or null hypothesis, of the test is that the sample populations have the same mean, e.g. that there is no difference between the samples or that the samples are drawn from the same underlying population.

The test will calculate a p-value that can be interpreted as to whether the samples are the same (fail to reject the null hypothesis), or there is a statistically significant difference between the samples (reject the null hypothesis). A common significance level for interpreting the p-value is 5% or 0.05.

**Significance level (alpha)**: 5% or 0.05.

The size of the effect of comparing two groups can be quantified with an effect size measure. A common measure for comparing the difference in the mean from two groups is the Cohen’s d measure. It calculates a standard score that describes the difference in terms of the number of standard deviations that the means are different. A large effect size for Cohen’s d is 0.80 or higher, as is commonly accepted when using the measure.

**Effect Size**: Cohen’s d of at least 0.80.

We can use the default and assume a minimum statistical power of 80% or 0.8.

**Statistical Power**: 80% or 0.80.

For a given experiment with these defaults, we may be interested in estimating a suitable sample size. That is, how many observations are required from each sample in order to at least detect an effect of 0.80 with an 80% chance of detecting the effect if it is true (20% of a Type II error) and a 5% chance of detecting an effect if there is no such effect (Type I error).

We can solve this using a power analysis.

The statsmodels library provides the TTestIndPower class for calculating a power analysis for the Student’s t test with independent samples. Of note is the TTestPower class that can perform the same analysis for the paired Student’s t test.

The function solve_power() can be used to calculate one of the four parameters in a power analysis. In our case, we are interested in calculating the sample size. We can use the function by providing the three pieces of information we know (*alpha*, *effect*, and *power*) and setting the size of argument we wish to calculate the answer of (*nobs1*) to “*None*“. This tells the function what to calculate.

A note on sample size: the function has an argument called ratio that is the ratio of the number of samples in one sample to the other. If both samples are expected to have the same number of observations, then the ratio is 1.0. If, for example, the second sample is expected to have half as many observations, then the ratio would be 0.5.

The TTestIndPower instance must be created, then we can call the *solve_power()* with our arguments to estimate the sample size for the experiment.

# perform power analysis analysis = TTestIndPower() result = analysis.solve_power(effect, power=power, nobs1=None, ratio=1.0, alpha=alpha)

The complete example is listed below.

# estimate sample size via power analysis from statsmodels.stats.power import TTestIndPower # parameters for power analysis effect = 0.8 alpha = 0.05 power = 0.8 # perform power analysis analysis = TTestIndPower() result = analysis.solve_power(effect, power=power, nobs1=None, ratio=1.0, alpha=alpha) print('Sample Size: %.3f' % result)

Running the example calculates and prints the estimated number of samples for the experiment as 25. This would be a suggested minimum number of samples required to see an effect of the desired size.

Sample Size: 25.525

We can go one step further and calculate power curves.

Power curves are line plots that show how the change in variables, such as effect size and sample size, impact the power of the statistical test.

The plot_power() function can be used to create power curves. The dependent variable (x-axis) must be specified by name in the ‘*dep_var*‘ argument. Arrays of values can then be specified for the sample size (*nobs*), effect size (*effect_size*), and significance (*alpha*) parameters. One or multiple curves will then be plotted showing the impact on statistical power.

For example, we can assume a significance of 0.05 (the default for the function) and explore the change in sample size between 5 and 100 with low, medium, and high effect sizes.

# calculate power curves from multiple power analyses analysis = TTestIndPower() analysis.plot_power(dep_var='nobs', nobs=arange(5, 100), effect_size=array([0.2, 0.5, 0.8]))

The complete example is listed below.

# calculate power curves for varying sample and effect size from numpy import array from matplotlib import pyplot from statsmodels.stats.power import TTestIndPower # parameters for power analysis effect_sizes = array([0.2, 0.5, 0.8]) sample_sizes = array(range(5, 100)) # calculate power curves from multiple power analyses analysis = TTestIndPower() analysis.plot_power(dep_var='nobs', nobs=sample_sizes, effect_size=effect_sizes) pyplot.show()

Running the example creates the plot showing the impact on statistical power (y-axis) for three different effect sizes (*es*) as the sample size (x-axis) is increased.

We can see that if we are interested in a large effect that a point of diminishing returns in terms of statistical power occurs at around 40-to-50 observations.

Usefully, statsmodels has classes to perform a power analysis with other statistical tests, such as the F-test, Z-test, and the Chi-Squared test.

This section lists some ideas for extending the tutorial that you may wish to explore.

- Plot the power curves of different standard significance levels against the sample size.
- Find an example of a study that reports the statistical power of the experiment.
- Prepare examples of a performance analysis for other statistical tests provided by statsmodels.

If you explore any of these extensions, I’d love to know.

This section provides more resources on the topic if you are looking to go deeper.

- The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis, and the Interpretation of Research Results, 2010.
- Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis, 2011.
- Statistical Power Analysis for the Behavioral Sciences, 1988.
- Applied Power Analysis for the Behavioral Sciences, 2010.

- Statsmodels Power and Sample Size Calculations
- statsmodels.stats.power.TTestPower API
- statsmodels.stats.power.TTestIndPower
- statsmodels.stats.power.TTestIndPower.solve_power() API

statsmodels.stats.power.TTestIndPower.plot_power() API - Statistical Power in Statsmodels, 2013.
- Power Plots in statsmodels, 2013.

- Statistical power on Wikipedia
- Statistical hypothesis testing on Wikipedia
- Statistical significance on Wikipedia
- Sample size determination on Wikipedia
- Effect size on Wikipedia
- Type I and type II errors on Wikipedia

In this tutorial, you discovered the statistical power of a hypothesis test and how to calculate power analyses and power curves as part of experimental design.

Specifically, you learned:

- Statistical power is the probability of a hypothesis test of finding an effect if there is an effect to be found.
- A power analysis can be used to estimate the minimum sample size required for an experiment, given a desired significance level, effect size, and statistical power.
- How to calculate and plot power analysis for the Student’s t test in Python in order to effectively design an experiment.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Statistical Power and Power Analysis in Python appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to Effect Size Measures in Python appeared first on Machine Learning Mastery.

]]>Hypothesis tests do not comment on the size of the effect if the association or difference is statistically significant. This highlights the need for standard ways of calculating and reporting a result.

Effect size methods refer to a suite of statistical tools from the the field of estimation statistics for quantifying an the size of an effect in the results of experiments that can be used to complement the results from statistical hypothesis tests.

In this tutorial, you will discover effect size and effect size measures for quantifying the magnitude of a result.

After completing this tutorial, you will know:

- The importance of calculating and reporting effect size in the results of experiments.
- Effect size measures for quantifying the association between variables, such as Pearson’s correlation coefficient.
- Effect size measures for quantifying the difference between groups, such as Cohen’s d measure.

Let’s get started.

This tutorial is divided into three parts; they are:

- The Need to Report Effect Size
- What Is Effect Size?
- How to Calculate Effect Size

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Once practitioners become versed in statistical methods, it is common to become focused on quantifying the likelihood of a result.

This is often seen with the calculation and presentation of the results from statistical hypothesis tests in terms of p-value and the significance level.

One aspect that is often neglected in the presentation of results is to actually quantify the difference or relationship, called the effect. It can be easy to forget that the intention of an experiment is to quantify an effect.

The primary product of a research inquiry is one or more measures of effect size, not P values.

— Things I have learned (so far), 1990.

The statistical test can only comment on the likelihood that there is an effect. It does not comment on the size of the effect. The results of an experiment could be significant, but the effect so small that it has little consequence.

It is possible, and unfortunately quite common, for a result to be statistically significant and trivial. It is also possible for a result to be statistically nonsignificant and important.

— Page 4, The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis, and the Interpretation of Research Results, 2010.

The problem with neglecting the presentation of the effect is that it may be calculated using ad hoc measures or even ignored completely and left to the reader to interpret. This is a big problem as quantifying the size of the effect is essential to interpreting results.

An effect size refers to the size or magnitude of an effect or result as it would be expected to occur in a population.

The effect size is estimated from samples of data.

Effect size methods refers to a collection of statistical tools used to calculate the effect size. Often the field of effect size measures is referred to as simply “*effect size*“, to note the general concern of the field.

It is common to organize effect size statistical methods into groups, based on the type of effect that is to be quantified. Two main groups of methods for calculating effect size are:

**Association**. Statistical methods for quantifying an association between variables (e.g. correlation).**Difference**. Statistical methods for quantifying the difference between variables (e.g. difference between means).

An effect can be the result of a treatment revealed in a comparison between groups (e.g. treated and untreated groups) or it can describe the degree of association between two related variables (e.g. treatment dosage and health).

— Page 5, The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis, and the Interpretation of Research Results, 2010.

The result of an effect size calculation must be interpreted, and it depends on the specific statistical method used. A measure must be chosen based on the goals of the interpretation. Three types of calculated result include:

**Standardized Result**. The effect size has a standard scale allowing it to be interpreted generally regardless of application (e.g. Cohen’s d calculation).**Original Units Result**. The effect size may use the original units of the variable, which can aid in the interpretation within the domain (e.g. difference between two sample means).**Unit Free Result**. The effect size may not have units such as a count or proportion (e.g. a correlation coefficient).

Thus, effect size can refer to the raw difference between group means, or absolute effect size, as well as standardized measures of effect, which are calculated to transform the effect to an easily understood scale. Absolute effect size is useful when the variables under study have intrinsic meaning (eg, number of hours of sleep).

— Using Effect Size—or Why the P Value Is Not Enough, 2012.

It may be a good idea to report an effect size using multiple measures to aide the different types of readers of your findings.

Sometimes a result is best reported both in original units, for ease of understanding by readers, and in some standardized measure for ease of inclusion in future meta-analyses.

— Page 41, Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis, 2011.

The effect size does not replace the results of a statistical hypothesis test. Instead, the effect size complements the test. Ideally, the results of both the hypothesis test and the effect size calculation would be presented side-by-side.

**Hypothesis Test**: Quantify the likelihood of observing the data given an assumption (null hypothesis).**Effect Size**: Quantify the size of the effect assuming that the effect is present.

The calculation of an effect size could be the calculation of a mean of a sample or the absolute difference between two means. It could also be a more elaborate statistical calculation.

In this section, we will look at some common effect size calculations for both associations and differences. The examples of methods is not complete; there may be 100s of methods that can be used to calculate an effect size.

The association between variables is often referred to as the “*r family*” of effect size methods.

This name comes from perhaps the most common method for calculating the effect size called Pearson’s correlation coefficient, also called Pearson’s r.

The Pearson’s correlation coefficient measures the degree of linear association between two real-valued variables. It is a unit-free effect size measure, that can be interpreted in a standard way, as follows:

- -1.0: Perfect negative relationship.
- -0.7: Strong negative relationship
- -0.5: Moderate negative relationship
- -0.3: Weak negative relationship
- 0.0: No relationship.
- 0.3: Weak positive relationship
- 0.5: Moderate positive relationship
- 0.7: Strong positive relationship
- 1.0: Perfect positive relationship.

The Pearson’s correlation coefficient can be calculated in Python using the pearsonr() SciPy function.

The example below demonstrates the calculation of the Pearson’s correlation coefficient to quantify the size of the association between two samples of random Gaussian numbers where one sample has a strong relationship with the second.

# calculate the Pearson's correlation between two variables from numpy.random import randn from numpy.random import seed from scipy.stats import pearsonr # seed random number generator seed(1) # prepare data data1 = 10 * randn(10000) + 50 data2 = data1 + (10 * randn(10000) + 50) # calculate Pearson's correlation corr, _ = pearsonr(data1, data2) print('Pearsons correlation: %.3f' % corr)

Running the example calculates and prints the Pearson’s correlation between the two data samples. We can see that the effect shows a strong positive relationship between the samples.

Pearson’s correlation: 0.712

Another very popular method for calculating the association effect size is the r-squared measure, or r^2, also called the coefficient of determination. It summarizes the proportion of variance in one variable explained by the other.

The difference between groups is often referred to as the “*d family*” of effect size methods.

This name comes from perhaps the most common method for calculating the difference between the mean value of groups, called Cohen’s d.

Cohen’s d measures the difference between the mean from two Gaussian-distributed variables. It is a standard score that summarizes the difference in terms of the number of standard deviations. Because the score is standardized, there is a table for the interpretation of the result, summarized as:

**Small Effect Size**: d=0.20**Medium Effect Size**: d=0.50**Large Effect Size**: d=0.80

The Cohen’s d calculation is not provided in Python; we can calculate it manually.

The calculation of the difference between the mean of two samples is as follows:

d = (u1 - u2) / s

Where *d* is the Cohen’s d, *u1* is the mean of the first sample, *u2* is the mean of the second sample, and *s* is the pooled standard deviation of both samples.

The pooled standard deviation for two independent samples can be calculated as follows:

s = sqrt(((n1 - 1) . s1^2 + (n2 - 1) . s2^2) / (n1 + n2 - 2))

Where *s* is the pooled standard deviation, *n1* and *n2* are the size of the first sample and second samples and *s1^2* and *s2^2* is the variance for the first and second samples. The subtractions are the adjustments for the number of degrees of freedom.

The function below will calculate the Cohen’s d measure for two samples of real-valued variables. The NumPy functions mean() and var() are used to calculate the sample mean and variance respectively.

# function to calculate Cohen's d for independent samples def cohend(d1, d2): # calculate the size of samples n1, n2 = len(d1), len(d2) # calculate the variance of the samples s1, s2 = var(d1, ddof=1), var(d2, ddof=1) # calculate the pooled standard deviation s = sqrt(((n1 - 1) * s1 + (n2 - 1) * s2) / (n1 + n2 - 2)) # calculate the means of the samples u1, u2 = mean(d1), mean(d2) # calculate the effect size return (u1 - u2) / s

The example below calculates the Cohen’s d measure for two samples of random Gaussian variables with differing means.

The example is contrived such that the means are different by one half standard deviation and both samples have the same standard deviation.

# calculate the Cohen's d between two samples from numpy.random import randn from numpy.random import seed from numpy import mean from numpy import var from math import sqrt # function to calculate Cohen's d for independent samples def cohend(d1, d2): # calculate the size of samples n1, n2 = len(d1), len(d2) # calculate the variance of the samples s1, s2 = var(d1, ddof=1), var(d2, ddof=1) # calculate the pooled standard deviation s = sqrt(((n1 - 1) * s1 + (n2 - 1) * s2) / (n1 + n2 - 2)) # calculate the means of the samples u1, u2 = mean(d1), mean(d2) # calculate the effect size return (u1 - u2) / s # seed random number generator seed(1) # prepare data data1 = 10 * randn(10000) + 60 data2 = 10 * randn(10000) + 55 # calculate cohen's d d = cohend(data1, data2) print('Cohens d: %.3f' % d)

Running the example calculates and prints the Cohen’s d effect size.

We can see that as expected, the difference between the means is one half of one standard deviation interpreted as a medium effect size.

Cohen's d: 0.500

Two other popular methods for quantifying the difference effect size are:

**Odds Ratio**. Measures the odds of an outcome occurring from one treatment compared to another.**Relative Risk Ratio**. Measures the probabilities of an outcome occurring from one treatment compared to another.

This section lists some ideas for extending the tutorial that you may wish to explore.

- Find an example where effect size is reported along with the results of statistical significance in a research paper.
- Implement a function to calculate the Cohen’s d for paired samples and demonstrate it on a test dataset.
- Implement and demonstrate another difference effect measure, such as the odds or risk ratios.

If you explore any of these extensions, I’d love to know.

This section provides more resources on the topic if you are looking to go deeper.

- The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis, and the Interpretation of Research Results, 2010.
- Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis, 2011.
- Statistical Power Analysis for the Behavioral Sciences, 1988.

In this tutorial, you discovered effect size and effect size measures for quantifying the magnitude of a result.

Specifically, you learned:

- The importance of calculating and reporting effect size in the results of experiments.
- Effect size measures for quantifying the association between variables, such as Pearson’s correlation coefficient.
- Effect size measures for quantifying the difference between groups, such as Cohen’s d measure.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Effect Size Measures in Python appeared first on Machine Learning Mastery.

]]>The post How to Calculate Nonparametric Rank Correlation in Python appeared first on Machine Learning Mastery.

]]>It is easy to calculate and interpret when both variables have a well understood Gaussian distribution. When we do not know the distribution of the variables, we must use nonparametric rank correlation methods.

In this tutorial, you will discover rank correlation methods for quantifying the association between variables with a non-Gaussian distribution.

After completing this tutorial, you will know:

- How rank correlation methods work and the methods are that are available.
- How to calculate and interpret the Spearman’s rank correlation coefficient in Python.
- How to calculate and interpret the Kendall’s rank correlation coefficient in Python.

Let’s get started.

This tutorial is divided into 4 parts; they are:

- Rank Correlation
- Test Dataset
- Spearman’s Rank Correlation
- Kendall’s Rank Correlation

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Correlation refers to the association between the observed values of two variables.

The variables may have a positive association, meaning that as the values for one variable increase, so do the values of the other variable. The association may also be negative, meaning that as the values of one variable increase, the values of the others decrease. Finally, the association may be neutral, meaning that the variables are not associated.

Correlation quantifies this association, often as a measure between the values -1 to 1 for perfectly negatively correlated and perfectly positively correlated. The calculated correlation is referred to as the “*correlation coefficient*.” This correlation coefficient can then be interpreted to describe the measures.

See the table below to help with interpretation the correlation coefficient.

The correlation between two variables that each have a Gaussian distribution can be calculated using standard methods such as the Pearson’s correlation. This procedure cannot be used for data that does not have a Gaussian distribution. Instead, rank correlation methods must be used.

Rank correlation refers to methods that quantify the association between variables using the ordinal relationship between the values rather than the specific values. Ordinal data is data that has label values and has an order or rank relationship; for example: ‘*low*‘, ‘*medium*‘, and ‘*high*‘.

Rank correlation can be calculated for real-valued variables. This is done by first converting the values for each variable into rank data. This is where the values are ordered and assigned an integer rank value. Rank correlation coefficients can then be calculated in order to quantify the association between the two ranked variables.

Because no distribution for the values is assumed, rank correlation methods are referred to as distribution-free correlation or nonparametric correlation. Interestingly, rank correlation measures are often used as the basis for other statistical hypothesis tests, such as determining whether two samples were likely drawn from the same (or different) population distributions.

Rank correlation methods are often named after the researcher or researchers that developed the method. Four examples of rank correlation methods are as follows:

- Spearman’s Rank Correlation.
- Kendall’s Rank Correlation.
- Goodman and Kruskal’s Rank Correlation.
- Somers’ Rank Correlation.

In the following sections, we will take a closer look at two of the more common rank correlation methods: Spearman’s and Kendall’s.

Before we demonstrate rank correlation methods, we must first define a test problem.

In this section, we will define a simple two-variable dataset where each variable is drawn from a uniform distribution (e.g. non-Gaussian) and the values of the second variable depend on the values of the first value.

Specifically, a sample of 1,000 random floating point values are drawn from a uniform distribution and scaled to the range 0 to 20. A second sample of 1,000 random floating point values are drawn from a uniform distribution between 0 and 10 and added to values in the first sample to create an association.

# prepare data data1 = rand(1000) * 20 data2 = data1 + (rand(1000) * 10)

The complete example is listed below.

# generate related variables from numpy.random import rand from numpy.random import seed from matplotlib import pyplot # seed random number generator seed(1) # prepare data data1 = rand(1000) * 20 data2 = data1 + (rand(1000) * 10) # plot pyplot.scatter(data1, data2) pyplot.show()

Running the example generates the data sample and graphs the points on a scatter plot.

We can clearly see that each variable has a uniform distribution and the positive association between the variables is visible by the diagonal grouping of the points from the bottom left to the top right of the plot.

Spearman’s rank correlation is named for Charles Spearman.

It may also be called Spearman’s correlation coefficient and is denoted by the lowercase greek letter rho (p). As such, it may be referred to as Spearman’s rho.

This statistical method quantifies the degree to which ranked variables are associated by a monotonic function, meaning an increasing or decreasing relationship. As a statistical hypothesis test, the method assumes that the samples are uncorrelated (fail to reject H0).

The Spearman rank-order correlation is a statistical procedure that is designed to measure the relationship between two variables on an ordinal scale of measurement.

— Page 124, Nonparametric Statistics for Non-Statisticians: A Step-by-Step Approach, 2009.

The intuition for the Spearman’s rank correlation is that it calculates a Pearson’s correlation (e.g. a parametric measure of correlation) using the rank values instead of the real values. Where the Pearson’s correlation is the calculation of the covariance (or expected difference of observations from the mean) between the two variables normalized by the variance or spread of both variables.

Spearman’s rank correlation can be calculated in Python using the spearmanr() SciPy function.

The function takes two real-valued samples as arguments and returns both the correlation coefficient in the range between -1 and 1 and the p-value for interpreting the significance of the coefficient.

# calculate spearman's correlation coef, p = spearmanr(data1, data2)

We can demonstrate the Spearman’s rank correlation on the test dataset. We know that there is a strong association between the variables in the dataset and we would expect the Spearman’s test to find this association.

The complete example is listed below.

# calculate the spearman's correlation between two variables from numpy.random import rand from numpy.random import seed from scipy.stats import spearmanr # seed random number generator seed(1) # prepare data data1 = rand(1000) * 20 data2 = data1 + (rand(1000) * 10) # calculate spearman's correlation coef, p = spearmanr(data1, data2) print('Spearmans correlation coefficient: %.3f' % coef) # interpret the significance alpha = 0.05 if p > alpha: print('Samples are uncorrelated (fail to reject H0) p=%.3f' % p) else: print('Samples are correlated (reject H0) p=%.3f' % p)

Running the example calculates the Spearman’s correlation coefficient between the two variables in the test dataset.

The statistical test reports a strong positive correlation with a value of 0.9. The p-value is close to zero, which means that the likelihood of observing the data given that the samples are uncorrelated is very unlikely (e.g. 95% confidence) and that we can reject the null hypothesis that the samples are uncorrelated.

Spearmans correlation coefficient: 0.900 Samples are correlated (reject H0) p=0.000

Kendall’s rank correlation is named for Maurice Kendall.

It is also called Kendall’s correlation coefficient, and the coefficient is often referred to by the lowercase Greek letter tau (t). In turn, the test may be called Kendall’s tau.

The intuition for the test is that it calculates a normalized score for the number of matching or concordant rankings between the two samples. As such, the test is also referred to as Kendall’s concordance test.

The Kendall’s rank correlation coefficient can be calculated in Python using the kendalltau() SciPy function. The test takes the two data samples as arguments and returns the correlation coefficient and the p-value. As a statistical hypothesis test, the method assumes (H0) that there is no association between the two samples.

# calculate kendall's correlation coef, p = kendalltau(data1, data2)

We can demonstrate the calculation on the test dataset, where we do expect a significant positive association to be reported.

The complete example is listed below.

# calculate the kendall's correlation between two variables from numpy.random import rand from numpy.random import seed from scipy.stats import kendalltau # seed random number generator seed(1) # prepare data data1 = rand(1000) * 20 data2 = data1 + (rand(1000) * 10) # calculate kendall's correlation coef, p = kendalltau(data1, data2) print('Kendall correlation coefficient: %.3f' % coef) # interpret the significance alpha = 0.05 if p > alpha: print('Samples are uncorrelated (fail to reject H0) p=%.3f' % p) else: print('Samples are correlated (reject H0) p=%.3f' % p)

Running the example calculates the Kendall’s correlation coefficient as 0.7, which is highly correlated.

The p-value is close to zero (and printed as zero), as with the Spearman’s test, meaning that we can confidently reject the null hypothesis that the samples are uncorrelated.

Kendall correlation coefficient: 0.709 Samples are correlated (reject H0) p=0.000

This section lists some ideas for extending the tutorial that you may wish to explore.

- List three examples where calculating a nonparametric correlation coefficient might be useful during a machine learning project.
- Update each example to calculate the correlation between uncorrelated data samples drawn from a non-Gaussian distribution.
- Load a standard machine learning dataset and calculate the pairwise nonparametric correlation between all variables.

If you explore any of these extensions, I’d love to know.

This section provides more resources on the topic if you are looking to go deeper.

- Nonparametric Statistics for Non-Statisticians: A Step-by-Step Approach, 2009.
- Applied Nonparametric Statistical Methods, Fourth Edition, 2007.
- Rank Correlation Methods, 1990.

- Nonparametric statistics on Wikipedia
- Rank correlation on Wikipedia
- Spearman’s rank correlation coefficient on Wikipedia
- Kendall rank correlation coefficient on Wikipedia
- Goodman and Kruskal’s gamma on Wikipedia
- Somers’ D on Wikipedia

In this tutorial, you discovered rank correlation methods for quantifying the association between variables with a non-Gaussian distribution.

Specifically, you learned:

- How rank correlation methods work and the methods are that are available.
- How to calculate and interpret the Spearman’s rank correlation coefficient in Python.
- How to calculate and interpret the Kendall’s rank correlation coefficient in Python.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Calculate Nonparametric Rank Correlation in Python appeared first on Machine Learning Mastery.

]]>The post Statistics in Plain English for Machine Learning appeared first on Machine Learning Mastery.

]]>A big problem in choosing a beginner book on statistics is that a book may suffer one of two common problems.

It may be a mathematical textbook filled with derivations, special cases, and proofs for each statistical method with little idea for the intuition for the method or how to use it. Or it may be a playbook for a proprietary or ancient statistical package with little relevance to the libraries and problems you face.

In this post, you will discover the book “*Statistics in Plain English*” for learning about statistical methods without getting too bogged down in theory nor implementation details.

After reading this post, you will know:

- That the book is intended to provide a clear presentation of statistical methods for practitioners.
- The contents of the book focus on the foundations, Gaussian distribution, and parametric statistical hypothesis tests.
- A careful reading list can be used to learn about the specific methods relevant to machine learning practitioners.

Let’s get started.

- Statistics in Plain English
- Contents of the Book
- Reading list for Machine Learning

Statistics in Plain English provides an introduction to statistics for students that might be taking a statistics class as part of some other degree program in social sciences.

It was written by Timothy Urdan who is a researcher and professor of Psychology. It is a popular book because of the accessibility of the writing and is currently in the fourth edition. I have the third edition, so any quotes and table of contents will reference that version.

It is not a textbook nor an exercise book, but something in between. Tim modestly states the purpose of the book as follows:

The purpose of this book is to make it a little easier to understand statistics.

His intention is for the book to act as a compliment to a more dense textbook on statistics. Again, I think this is modest and mentioned because it does not dive into more mathematical rigor (derivation and proofs) behind the methods and focuses on the application and intuition for the methods (i.e. what you care about as a practitioner).

I do think that the book is more than suitable as a first step into statistics.

Each chapter introduces a statistic (sometimes more than one) using a consistent template with three parts, as follows:

- A short description of the statistic.
- A longer description of the equation and details of the statistics.
- A worked example for using the statistic.

The book is not long at less than 200 pages. It also uses a large form factor 11 x 5.5 inches, meaning that physically holding the book gives a lot of space to the ideas and examples.

If you have the time and are really new to the field of statistics, it is worth reading cover to cover. Seriously. Even if you’re familiar with the topic, it’s a great read.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

I recommend studying the table of contents.

It is useful for two reasons:

- To get an idea of the breadth in topics for introductory statistics.
- To get an idea of what topics might interest you or be relevant to your projects.

The full 15-chapter table of contents from the third edition of the book are as follows:

- Chapter 1: Introduction to Social Science Research Principles and Terminology
- Chapter 2: Measures of Central Tendency
- Chapter 3: Measures of Variability
- Chapter 4: The Normal Distribution
- Chapter 5: Standardization and z Scores
- Chapter 6: Standard Errors
- Chapter 7: Statistical Significance, Effect Size, and Confidence Intervals
- Chapter 8: Correlation
- Chapter 9: t Tests
- Chapter 10: One-Way Analysis of Variance
- Chapter 11: Factorial Analysis of Variance
- Chapter 12: Repeated-Measures Analysis of Variance
- Chapter 13: Regression
- Chapter 14: The Chi-Square Test of Independence
- Chapter 15: Factor Analysis and Reliability Analysis: Data Reduction Techniques

The presentation provides a clear separation of the topics.

It allows you to pick and choose the topics or chapters that interest you the most and dive in, without having to read prior chapters.

Thee book is organized such that the more basic statistics and statistical concepts are in the earlier chapters whereas the more complex concepts appear later in the book. However, it is not necessary to read one chapter before understanding the next. Rather, each chapter in the book was written to stand on its own.

A review of the table of contents highlights two things:

- The book has a strong focus on the Gaussian distribution, which is reasonable given the importance of this distribution in both probability and statistics.
- The book also has a large focus on statistical hypothesis tests, specifically parametric tests, which aligns with the focus on the Gaussian distribution.

This chosen focus will handle most of the statistical methods required when working with social science experimental data, at least in the beginning. There are a few holes though for the machine learning practitioner. For example:

- The book does not have much on estimation methods, a little on confidence intervals, but nothing on prediction intervals and tolerance intervals.
- The book also does not cover resampling methods (bootstrap, k-fold cross-validation and more).
- The whole area of nonparametric statistical methods are also skipped.

Nevertheless, these topics can be looked up in more targeted books.

It’s a great book and I do recommend it if you are new to statistics and you’re looking for a clear presentation of the foundations that you really do need to know in applied machine learning.

As I mentioned above, it is not a long read and well worth reading cover to cover.

With that being said, not all chapters are relevant or directly useful to you as a machine learning practitioner.

Below is a breakdown or suggested reading list of the book for machine learning practitioners.

I think you need to have some understanding of foundational statistics no matter what. I would recommend reading the first few chapters in order to get this grounding, at least:

- Chapter 1: Introduction to Social Science Research Principles and Terminology
- Chapter 2: Measures of Central Tendency
- Chapter 3: Measures of Variability
- Chapter 4: The Normal Distribution

To beef-up your skills in understanding your training data and in data preparation, I would recommend the following three chapters:

- Chapter 5: Standardization and z Scores
- Chapter 8: Correlation
- Chapter 14: The Chi-Square Test of Independence

For evaluating and comparing machine learning models and model parameters, you can use statistical hypothesis tests. To get started in this area, I would recommend the following two chapters:

- Chapter 7: Statistical Significance, Effect Size, and Confidence Intervals
- Chapter 9: t Tests

You could probably skip the other chapters.

The chapter on linear regression (Chapter 13) might be of interest if you use the method and are interested in a deeper idea of how and why it works.

Do you agree with this reading plan?

Let me know in the comments below.

This section provides more resources on the topic if you are looking to go deeper.

- Statistics in Plain English, Fourth Edition, 2016.
- Statistics in Plain English, Third Edition, 2010.
- Timothy Urdan’s Homepage

In this post, you discovered the book “Statistics in Plain English” for learning about statistical methods without getting too bogged down in theory (proofs and derivations) nor implementation details (pages of code and commands for proprietary statistical packages).

Specifically, you learned:

- That the book is intended to provide a clear presentation of statistical methods for practitioners.
- The contents of the book focus on the foundations, Gaussian distribution, and parametric statistical hypothesis tests.
- A careful reading list can be used to learn about the specific methods relevant to machine learning practitioners.

Do you have this book or have you read it?

What do you think of it? Share your thoughts below.

Are you thinking of getting this book?

Why or why not?

The post Statistics in Plain English for Machine Learning appeared first on Machine Learning Mastery.

]]>