Statistical hypothesis tests report on the likelihood of the observed results given an assumption, such as no association between variables or no difference between groups.

Hypothesis tests do not comment on the size of the effect if the association or difference is statistically significant. This highlights the need for standard ways of calculating and reporting a result.

Effect size methods refer to a suite of statistical tools from the the field of estimation statistics for quantifying an the size of an effect in the results of experiments that can be used to complement the results from statistical hypothesis tests.

In this tutorial, you will discover effect size and effect size measures for quantifying the magnitude of a result.

After completing this tutorial, you will know:

- The importance of calculating and reporting effect size in the results of experiments.
- Effect size measures for quantifying the association between variables, such as Pearson’s correlation coefficient.
- Effect size measures for quantifying the difference between groups, such as Cohen’s d measure.

**Kick-start your project** with my new book Statistics for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

Letâ€™s get started.

## Tutorial Overview

This tutorial is divided into three parts; they are:

- The Need to Report Effect Size
- What Is Effect Size?
- How to Calculate Effect Size

### Need help with Statistics for Machine Learning?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

## The Need to Report Effect Size

Once practitioners become versed in statistical methods, it is common to become focused on quantifying the likelihood of a result.

This is often seen with the calculation and presentation of the results from statistical hypothesis tests in terms of p-value and the significance level.

One aspect that is often neglected in the presentation of results is to actually quantify the difference or relationship, called the effect. It can be easy to forget that the intention of an experiment is to quantify an effect.

The primary product of a research inquiry is one or more measures of effect size, not P values.

— Things I have learned (so far), 1990.

The statistical test can only comment on the likelihood that there is an effect. It does not comment on the size of the effect. The results of an experiment could be significant, but the effect so small that it has little consequence.

It is possible, and unfortunately quite common, for a result to be statistically significant and trivial. It is also possible for a result to be statistically nonsignificant and important.

— Page 4, The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis, and the Interpretation of Research Results, 2010.

The problem with neglecting the presentation of the effect is that it may be calculated using ad hoc measures or even ignored completely and left to the reader to interpret. This is a big problem as quantifying the size of the effect is essential to interpreting results.

## What Is Effect Size?

An effect size refers to the size or magnitude of an effect or result as it would be expected to occur in a population.

The effect size is estimated from samples of data.

Effect size methods refers to a collection of statistical tools used to calculate the effect size. Often the field of effect size measures is referred to as simply “*effect size*“, to note the general concern of the field.

It is common to organize effect size statistical methods into groups, based on the type of effect that is to be quantified. Two main groups of methods for calculating effect size are:

**Association**. Statistical methods for quantifying an association between variables (e.g. correlation).**Difference**. Statistical methods for quantifying the difference between variables (e.g. difference between means).

An effect can be the result of a treatment revealed in a comparison between groups (e.g. treated and untreated groups) or it can describe the degree of association between two related variables (e.g. treatment dosage and health).

— Page 5, The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis, and the Interpretation of Research Results, 2010.

The result of an effect size calculation must be interpreted, and it depends on the specific statistical method used. A measure must be chosen based on the goals of the interpretation. Three types of calculated result include:

**Standardized Result**. The effect size has a standard scale allowing it to be interpreted generally regardless of application (e.g. Cohen’s d calculation).**Original Units Result**. The effect size may use the original units of the variable, which can aid in the interpretation within the domain (e.g. difference between two sample means).**Unit Free Result**. The effect size may not have units such as a count or proportion (e.g. a correlation coefficient).

Thus, effect size can refer to the raw difference between group means, or absolute effect size, as well as standardized measures of effect, which are calculated to transform the effect to an easily understood scale. Absolute effect size is useful when the variables under study have intrinsic meaning (eg, number of hours of sleep).

— Using Effect Sizeâ€”or Why the P Value Is Not Enough, 2012.

It may be a good idea to report an effect size using multiple measures to aide the different types of readers of your findings.

Sometimes a result is best reported both in original units, for ease of understanding by readers, and in some standardized measure for ease of inclusion in future meta-analyses.

— Page 41, Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis, 2011.

The effect size does not replace the results of a statistical hypothesis test. Instead, the effect size complements the test. Ideally, the results of both the hypothesis test and the effect size calculation would be presented side-by-side.

**Hypothesis Test**: Quantify the likelihood of observing the data given an assumption (null hypothesis).**Effect Size**: Quantify the size of the effect assuming that the effect is present.

## How to Calculate Effect Size

The calculation of an effect size could be the calculation of a mean of a sample or the absolute difference between two means. It could also be a more elaborate statistical calculation.

In this section, we will look at some common effect size calculations for both associations and differences. The examples of methods is not complete; there may be 100s of methods that can be used to calculate an effect size.

### Calculate Association Effect Size

The association between variables is often referred to as the “*r family*” of effect size methods.

This name comes from perhaps the most common method for calculating the effect size called Pearson’s correlation coefficient, also called Pearson’s r.

The Pearson’s correlation coefficient measures the degree of linear association between two real-valued variables. It is a unit-free effect size measure, that can be interpreted in a standard way, as follows:

- -1.0: Perfect negative relationship.
- -0.7: Strong negative relationship
- -0.5: Moderate negative relationship
- -0.3: Weak negative relationship
- 0.0: No relationship.
- 0.3: Weak positive relationship
- 0.5: Moderate positive relationship
- 0.7: Strong positive relationship
- 1.0: Perfect positive relationship.

The Pearson’s correlation coefficient can be calculated in Python using the pearsonr() SciPy function.

The example below demonstrates the calculation of the Pearson’s correlation coefficient to quantify the size of the association between two samples of random Gaussian numbers where one sample has a strong relationship with the second.

1 2 3 4 5 6 7 8 9 10 11 12 |
# calculate the Pearson's correlation between two variables from numpy.random import randn from numpy.random import seed from scipy.stats import pearsonr # seed random number generator seed(1) # prepare data data1 = 10 * randn(10000) + 50 data2 = data1 + (10 * randn(10000) + 50) # calculate Pearson's correlation corr, _ = pearsonr(data1, data2) print('Pearsons correlation: %.3f' % corr) |

Running the example calculates and prints the Pearson’s correlation between the two data samples. We can see that the effect shows a strong positive relationship between the samples.

1 |
Pearsonâ€™s correlation: 0.712 |

Another very popular method for calculating the association effect size is the r-squared measure, or r^2, also called the coefficient of determination. It summarizes the proportion of variance in one variable explained by the other.

### Calculate Difference Effect Size

The difference between groups is often referred to as the “*d family*” of effect size methods.

This name comes from perhaps the most common method for calculating the difference between the mean value of groups, called Cohen’s d.

Cohen’s d measures the difference between the mean from two Gaussian-distributed variables. It is a standard score that summarizes the difference in terms of the number of standard deviations. Because the score is standardized, there is a table for the interpretation of the result, summarized as:

**Small Effect Size**: d=0.20**Medium Effect Size**: d=0.50**Large Effect Size**: d=0.80

The Cohen’s d calculation is not provided in Python; we can calculate it manually.

The calculation of the difference between the mean of two samples is as follows:

1 |
d = (u1 - u2) / s |

Where *d* is the Cohen’s d, *u1* is the mean of the first sample, *u2* is the mean of the second sample, and *s* is the pooled standard deviation of both samples.

The pooled standard deviation for two independent samples can be calculated as follows:

1 |
s = sqrt(((n1 - 1) . s1^2 + (n2 - 1) . s2^2) / (n1 + n2 - 2)) |

Where *s* is the pooled standard deviation, *n1* and *n2* are the size of the first sample and second samples and *s1^2* and *s2^2* is the variance for the first and second samples. The subtractions are the adjustments for the number of degrees of freedom.

The function below will calculate the Cohen’s d measure for two samples of real-valued variables. The NumPy functions mean() and var() are used to calculate the sample mean and variance respectively.

1 2 3 4 5 6 7 8 9 10 11 12 |
# function to calculate Cohen's d for independent samples def cohend(d1, d2): # calculate the size of samples n1, n2 = len(d1), len(d2) # calculate the variance of the samples s1, s2 = var(d1, ddof=1), var(d2, ddof=1) # calculate the pooled standard deviation s = sqrt(((n1 - 1) * s1 + (n2 - 1) * s2) / (n1 + n2 - 2)) # calculate the means of the samples u1, u2 = mean(d1), mean(d2) # calculate the effect size return (u1 - u2) / s |

The example below calculates the Cohen’s d measure for two samples of random Gaussian variables with differing means.

The example is contrived such that the means are different by one half standard deviation and both samples have the same standard deviation.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
# calculate the Cohen's d between two samples from numpy.random import randn from numpy.random import seed from numpy import mean from numpy import var from math import sqrt # function to calculate Cohen's d for independent samples def cohend(d1, d2): # calculate the size of samples n1, n2 = len(d1), len(d2) # calculate the variance of the samples s1, s2 = var(d1, ddof=1), var(d2, ddof=1) # calculate the pooled standard deviation s = sqrt(((n1 - 1) * s1 + (n2 - 1) * s2) / (n1 + n2 - 2)) # calculate the means of the samples u1, u2 = mean(d1), mean(d2) # calculate the effect size return (u1 - u2) / s # seed random number generator seed(1) # prepare data data1 = 10 * randn(10000) + 60 data2 = 10 * randn(10000) + 55 # calculate cohen's d d = cohend(data1, data2) print('Cohens d: %.3f' % d) |

Running the example calculates and prints the Cohen’s d effect size.

We can see that as expected, the difference between the means is one half of one standard deviation interpreted as a medium effect size.

1 |
Cohen's d: 0.500 |

Two other popular methods for quantifying the difference effect size are:

**Odds Ratio**. Measures the odds of an outcome occurring from one treatment compared to another.**Relative Risk Ratio**. Measures the probabilities of an outcome occurring from one treatment compared to another.

## Extensions

This section lists some ideas for extending the tutorial that you may wish to explore.

- Find an example where effect size is reported along with the results of statistical significance in a research paper.
- Implement a function to calculate the Cohen’s d for paired samples and demonstrate it on a test dataset.
- Implement and demonstrate another difference effect measure, such as the odds or risk ratios.

If you explore any of these extensions, I’d love to know.

## Further Reading

This section provides more resources on the topic if you are looking to go deeper.

### Papers

### Books

- The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis, and the Interpretation of Research Results, 2010.
- Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis, 2011.
- Statistical Power Analysis for the Behavioral Sciences, 1988.

### API

### Articles

## Summary

In this tutorial, you discovered effect size and effect size measures for quantifying the magnitude of a result.

Specifically, you learned:

- The importance of calculating and reporting effect size in the results of experiments.
- Effect size measures for quantifying the association between variables, such as Pearson’s correlation coefficient.
- Effect size measures for quantifying the difference between groups, such as Cohen’s d measure.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

Great introduction, Jason!

Thanks.

Nice tutorial Jason, for Effect size and particularly Cohen’s d calculation, does the calculation of Cohen’s d for the effect size only work on 2 distributions that are Gaussian or can it work across any 2 distributions?

As an example I’m thinking if you had say 2 datasets that were Gamma distributions or one dataset was Gaussian and one was a Gamma distribution?

I think what I’m trying to ask is does this work for any 2 distributions or are there specific properties of the 2 distributions (ie both normal distributions) that are required for Cohen’s d to provide meaningful results?

I believe the data must be Gaussian.

Perhaps there are modifications you can make for different distributions.

Hello Jason,

How to understand s1, s2 = var(d1, ddof=1), var(d2, ddof=1)? what does ddof mean?

It means “degrees of freedom”:

https://en.wikipedia.org/wiki/Degrees_of_freedom_(statistics)

Hi Jason, wonderful post. I am trying to run the TTestPower.solve_power from statsmodels using the

effect_size = np.mean(sample_data) / np.std(sample_data)

nobs = 10

alpha = 0.05

However, I keep getting an error asking me to declare an additional argument called self. I have no idea what this argument self is and I am unable to find any documentation for it in the statsmodels website. I have spent countless hours trying to make this function work. Your help would be immensely helpful.

FYI. The syntax that I am using is given below followed by the error

TTestPower.solve_power(effect_size=1,nobs = 10,alpha = 0.05)

—————————————————————————

TypeError Traceback (most recent call last)

in

—-> 1 TTestPower.solve_power(effect_size=1,nobs = 10,alpha = 0.05)

TypeError: solve_power() missing 1 required positional argument: ‘self’

Sorry, I have not seen this problem, perhaps check your libraries are up to date?

If they are, perhaps try posting your code and error to stackoverflow?

Thank you for this wonderful article. Any good reference for non parametric effect sizes(non normal distributions) and meta analysis using it

Perhaps some of the resources in the “further reading” section will help as a first step.

Hey, I have found something that might help:

– https://www.researchgate.net/post/What_is_the_appropriate_effect_size_calculation_for_Wilcoxon_signed_rank_test_related_samples

– file:///C:/Users/eLab/Downloads/Effect_Size_Estimates_Current_Use_Calculations_and.pdf

– I followed the process here to calculate z https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html

Thanks for sharing.

Hi Jason, Thanks for another great article,

I have a question that isn’t it right that as a consequence of Central Limit Theorem, If we are try to compare two algorithms on a single domain (paired two sample t test), and we have at least 30 samples in the test set for measuring evaluating metrices (e.g. accuracy), then we can accept that the distribution of difference between evaluating result of two algorithms could be normal. And so, in those case that the test set contains more than 30 samples, we can safely use Cohen’s d in order to calculating the effect size?

Excellent article! I’m wondering: can we use Cohen’s d to quantify the difference between, say, the accuracies of two ML models? I’m considering that we have two distributions around the mean accuracies calculated by bootstrapping, for instance