How to Calculate Correlation Between Variables in Python

Last Updated on

There may be complex and unknown relationships between the variables in your dataset.

It is important to discover and quantify the degree to which variables in your dataset are dependent upon each other. This knowledge can help you better prepare your data to meet the expectations of machine learning algorithms, such as linear regression, whose performance will degrade with the presence of these interdependencies.

In this tutorial, you will discover that correlation is the statistical summary of the relationship between variables and how to calculate it for different types variables and relationships.

After completing this tutorial, you will know:

  • How to calculate a covariance matrix to summarize the linear relationship between two or more variables.
  • How to calculate the Pearson’s correlation coefficient to summarize the linear relationship between two variables.
  • How to calculate the Spearman’s correlation coefficient to summarize the monotonic relationship between two variables.

Discover statistical hypothesis testing, resampling methods, estimation statistics and nonparametric methods in my new book, with 29 step-by-step tutorials and full source code.

Let’s get started.

  • Update May/2018: Updated description of the sign of the covariance (thanks Fulya).
How to Use Correlation to Understand the Relationship Between Variables

How to Use Correlation to Understand the Relationship Between Variables
Photo by Fraser Mummery, some rights reserved.

Tutorial Overview

This tutorial is divided into 5 parts; they are:

  1. What is Correlation?
  2. Test Dataset
  3. Covariance
  4. Pearson’s Correlation
  5. Spearman’s Correlation

Need help with Statistics for Machine Learning?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Course

What is Correlation?

Variables within a dataset can be related for lots of reasons.

For example:

  • One variable could cause or depend on the values of another variable.
  • One variable could be lightly associated with another variable.
  • Two variables could depend on a third unknown variable.

It can be useful in data analysis and modeling to better understand the relationships between variables. The statistical relationship between two variables is referred to as their correlation.

A correlation could be positive, meaning both variables move in the same direction, or negative, meaning that when one variable’s value increases, the other variables’ values decrease. Correlation can also be neural or zero, meaning that the variables are unrelated.

  • Positive Correlation: both variables change in the same direction.
  • Neutral Correlation: No relationship in the change of the variables.
  • Negative Correlation: variables change in opposite directions.

The performance of some algorithms can deteriorate if two or more variables are tightly related, called multicollinearity. An example is linear regression, where one of the offending correlated variables should be removed in order to improve the skill of the model.

We may also be interested in the correlation between input variables with the output variable in order provide insight into which variables may or may not be relevant as input for developing a model.

The structure of the relationship may be known, e.g. it may be linear, or we may have no idea whether a relationship exists between two variables or what structure it may take. Depending what is known about the relationship and the distribution of the variables, different correlation scores can be calculated.

In this tutorial, we will look at one score for variables that have a Gaussian distribution and a linear relationship and another that does not assume a distribution and will report on any monotonic (increasing or decreasing) relationship.

Test Dataset

Before we look at correlation methods, let’s define a dataset we can use to test the methods.

We will generate 1,000 samples of two two variables with a strong positive correlation. The first variable will be random numbers drawn from a Gaussian distribution with a mean of 100 and a standard deviation of 20. The second variable will be values from the first variable with Gaussian noise added with a mean of a 50 and a standard deviation of 10.

We will use the randn() function to generate random Gaussian values with a mean of 0 and a standard deviation of 1, then multiply the results by our own standard deviation and add the mean to shift the values into the preferred range.

The pseudorandom number generator is seeded to ensure that we get the same sample of numbers each time the code is run.

Running the example first prints the mean and standard deviation for each variable.

A scatter plot of the two variables is created. Because we contrived the dataset, we know there is a relationship between the two variables. This is clear when we review the generated scatter plot where we can see an increasing trend.

Scatter plot of the test correlation dataset

Scatter plot of the test correlation dataset

Before we look at calculating some correlation scores, we must first look at an important statistical building block, called covariance.

Covariance

Variables can be related by a linear relationship. This is a relationship that is consistently additive across the two data samples.

This relationship can be summarized between two variables, called the covariance. It is calculated as the average of the product between the values from each sample, where the values haven been centered (had their mean subtracted).

The calculation of the sample covariance is as follows:

The use of the mean in the calculation suggests the need for each data sample to have a Gaussian or Gaussian-like distribution.

The sign of the covariance can be interpreted as whether the two variables change in the same direction (positive) or change in different directions (negative). The magnitude of the covariance is not easily interpreted. A covariance value of zero indicates that both variables are completely independent.

The cov() NumPy function can be used to calculate a covariance matrix between two or more variables.

The diagonal of the matrix contains the covariance between each variable and itself. The other values in the matrix represent the covariance between the two variables; in this case, the remaining two values are the same given that we are calculating the covariance for only two variables.

We can calculate the covariance matrix for the two variables in our test problem.

The complete example is listed below.

The covariance and covariance matrix are used widely within statistics and multivariate analysis to characterize the relationships between two or more variables.

Running the example calculates and prints the covariance matrix.

Because the dataset was contrived with each variable drawn from a Gaussian distribution and the variables linearly correlated, covariance is a reasonable method for describing the relationship.

The covariance between the two variables is 389.75. We can see that it is positive, suggesting the variables change in the same direction as we expect.

A problem with covariance as a statistical tool alone is that it is challenging to interpret. This leads us to the Pearson’s correlation coefficient next.

Pearson’s Correlation

The Pearson correlation coefficient (named for Karl Pearson) can be used to summarize the strength of the linear relationship between two data samples.

The Pearson’s correlation coefficient is calculated as the covariance of the two variables divided by the product of the standard deviation of each data sample. It is the normalization of the covariance between the two variables to give an interpretable score.

The use of mean and standard deviation in the calculation suggests the need for the two data samples to have a Gaussian or Gaussian-like distribution.

The result of the calculation, the correlation coefficient can be interpreted to understand the relationship.

The coefficient returns a value between -1 and 1 that represents the limits of correlation from a full negative correlation to a full positive correlation. A value of 0 means no correlation. The value must be interpreted, where often a value below -0.5 or above 0.5 indicates a notable correlation, and values below those values suggests a less notable correlation.

The pearsonr() SciPy function can be used to calculate the Pearson’s correlation coefficient between two data samples with the same length.

We can calculate the correlation between the two variables in our test problem.

The complete example is listed below.

Running the example calculates and prints the Pearson’s correlation coefficient.

We can see that the two variables are positively correlated and that the correlation is 0.8. This suggests a high level of correlation, e.g. a value above 0.5 and close to 1.0.

The Pearson’s correlation coefficient can be used to evaluate the relationship between more than two variables.

This can be done by calculating a matrix of the relationships between each pair of variables in the dataset. The result is a symmetric matrix called a correlation matrix with a value of 1.0 along the diagonal as each column always perfectly correlates with itself.

Spearman’s Correlation

Two variables may be related by a nonlinear relationship, such that the relationship is stronger or weaker across the distribution of the variables.

Further, the two variables being considered may have a non-Gaussian distribution.

In this case, the Spearman’s correlation coefficient (named for Charles Spearman) can be used to summarize the strength between the two data samples. This test of relationship can also be used if there is a linear relationship between the variables, but will have slightly less power (e.g. may result in lower coefficient scores).

As with the Pearson correlation coefficient, the scores are between -1 and 1 for perfectly negatively correlated variables and perfectly positively correlated respectively.

Instead of calculating the coefficient using covariance and standard deviations on the samples themselves, these statistics are calculated from the relative rank of values on each sample. This is a common approach used in non-parametric statistics, e.g. statistical methods where we do not assume a distribution of the data such as Gaussian.

A linear relationship between the variables is not assumed, although a monotonic relationship is assumed. This is a mathematical name for an increasing or decreasing relationship between the two variables.

If you are unsure of the distribution and possible relationships between two variables, Spearman correlation coefficient is a good tool to use.

The spearmanr() SciPy function can be used to calculate the Spearman’s correlation coefficient between two data samples with the same length.

We can calculate the correlation between the two variables in our test problem.

The complete example is listed below.

Running the example calculates and prints the Spearman’s correlation coefficient.

We know that the data is Gaussian and that the relationship between the variables is linear. Nevertheless, the nonparametric rank-based approach shows a strong correlation between the variables of 0.8.

As with the Pearson’s correlation coefficient, the coefficient can be calculated pair-wise for each variable in a dataset to give a correlation matrix for review.

For more help with non-parametric correlation methods in Python, see:

Extensions

This section lists some ideas for extending the tutorial that you may wish to explore.

  • Generate your own datasets with positive and negative relationships and calculate both correlation coefficients.
  • Write functions to calculate Pearson or Spearman correlation matrices for a provided dataset.
  • Load a standard machine learning dataset and calculate correlation coefficients between all pairs of real-valued variables.

If you explore any of these extensions, I’d love to know.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Posts

API

Articles

Summary

In this tutorial, you discovered that correlation is the statistical summary of the relationship between variables and how to calculate it for different types variables and relationships.

Specifically, you learned:

  • How to calculate a covariance matrix to summarize the linear relationship between two or more variables.
  • How to calculate the Pearson’s correlation coefficient to summarize the linear relationship between two variables.
  • How to calculate the Spearman’s correlation coefficient to summarize the monotonic relationship between two variables.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Get a Handle on Statistics for Machine Learning!

Statistical Methods for Machine Learning

Develop a working understanding of statistics

...by writing lines of code in python

Discover how in my new Ebook:
Statistical Methods for Machine Learning

It provides self-study tutorials on topics like:
Hypothesis Tests, Correlation, Nonparametric Stats, Resampling, and much more...

Discover how to Transform Data into Knowledge

Skip the Academics. Just Results.

See What's Inside

71 Responses to How to Calculate Correlation Between Variables in Python

  1. Roman April 27, 2018 at 3:24 pm #

    Hi, Jason!

    Thank you very much for the article!
    Maybe I’m wrong. But it seems to me that the covariance formula should be with an additional left bracket:
    cov(X, Y) = sum ((x – mean(X)) * (y – mean(Y)) * 1/(n-1)).
    Please correct me if I’m not right.

    • Jason Brownlee April 28, 2018 at 5:23 am #

      Fixed, thanks.

      • Dinesh August 13, 2019 at 7:20 pm #

        Hello Jason,

        It is a very good article on correlation for beginners like me.

        I think you missed one more bracket in covariance function.

        cov(X,Y) = sum((x-mean(X)) * (y-mean(Y))) * 1/(n-1)

  2. Jaime April 27, 2018 at 6:38 pm #

    Hi Jason, thank you for yet again another awesome tutorial, this is exactly what I was looking for in the past few days. I have a question, in case that we are interested in the correlation between our input variables and the output variable, can we simply compute it similarly only by using one of the correlation metrics, the desired input variable and the output variable? or is there a different procedure to follow when considering the output?

    • Jason Brownlee April 28, 2018 at 5:27 am #

      Yes, it is a helpful way for finding relevant/irrelevant input variables.

      It does ignore interactions between multiple inputs and the output though, so always experiment and use resulting model skill to guide you.

  3. Ethan Schmit April 27, 2018 at 11:19 pm #

    Hi Jason.

    I am curious, do you have a preferred measure to measure “correlation” between your inputs and a binary target value?

    • Jason Brownlee April 28, 2018 at 5:30 am #

      Good question.

      For two categorial variables, we can use the chi squared test.

      Generally, I would be looking at feature selection methods instead.

  4. Miha April 28, 2018 at 4:57 am #

    Hi Jason! Great article! I look forward to reading your stats book. And I love API section at the end of the blog! Cheers!

  5. Jay April 28, 2018 at 7:38 am #

    Hi Jason!
    Wow really thank you for your articles. Great articles.

    Do you have a plan to add Grandure causality analysis, which is also a way to measure a correlation between variables?
    Have a great day~

  6. Anthony The Koala April 29, 2018 at 10:42 pm #

    Dear Dr Jason,
    I wish to report typo two errors under “Test Dataset”, “Covariance” , “Pearson’s Correlation” and “Spearman’s Correlation”,

    where it respectively says in lines 11 & 12, 8 & 9, 8 & 9 and 8 & 9.

    it should be

    This is because under the former, you will get an error if using ‘randn’..
    Regards
    Anthony of exciting Belfield.

    • Anthony The Koala April 29, 2018 at 11:59 pm #

      Dear Dr Jason,.
      Many apologies for this. The tutorial was not obvious in distinguishing the ‘random’ and ‘randn’ functions.

      The difference are:
      ‘randn’ generates standard normal distributions from -1 to 1. Further help see

      And ‘random’ is a versatile function which allows one to generate values from various distributions such as uniform, and gaussian(mu, sigma) just to name a few.

      Sorry for ignoring the subtlety of importing ‘randn’. At the same time I did learn something.

      Thank you,
      Anthony of exciting Belfield

      • Jason Brownlee April 30, 2018 at 5:36 am #

        No problem.

        Teaching you something was the goal of this site!

    • Jason Brownlee April 30, 2018 at 5:35 am #

      It does not give an error on Py3 and the latest numpy.

      What version of Numpy/Python were you using?

      • Anthony The Koala April 30, 2018 at 11:45 am #

        Dear Dr Jason.
        It was my fault I did not include

        No problems after that.
        The versions are:

        All’s well that ends well,
        Anthony from downtown Belfield

  7. Yanyun Zou May 7, 2018 at 11:32 pm #

    Hi Jason,
    Thanks for your post. Here is my case, there are many candidate input variables, and I’d like to predict one output. And I want to select some related variables as input from all the variables. So can I use the Normalized Mutual Information (NMI) method to do the selection?

  8. Pankaj May 8, 2018 at 5:10 pm #

    Hi Jason,

    Can I apply the pearson correlation with two time series in order to find how two time series depend with each other?

    If not, could you please give some source or your another blog post to read.

    cheers

  9. Randy May 12, 2018 at 2:58 am #

    Is there an example using this simple dataset to separate the two gaussian distributions, showing a resulting scatterplot using two colors for each distribution ?

  10. Arturo July 18, 2018 at 11:49 am #

    Hi Jason,

    Thanks for the nice post. I have a small suggestion. It is my understanding that “relationship” is meant to be used between people (e.g., “they have a close relationship), while “relation” is meant to be used for more abstract concepts (such as between two variables).

    For more details, see http://www.learnersdictionary.com/qa/relations-and-relationship

    Cheers!

  11. Mayur August 21, 2018 at 11:18 pm #

    I am working on kaggle dataset and I want to check non-linear correlation between 2 features.
    So according to you Spearman’s Correlation can be used for my purpose right?
    What are the other ways to calculate non-linear relationships ?

    Thank you in advance.

  12. Mayur August 21, 2018 at 11:32 pm #

    According to my knowledge Spearman’s Correlation needs monotonic correlation between 2 features which similar to linear relationships(less restrictive though).
    Please correct me if I am wrong.

    Thanks

  13. Adrian November 22, 2018 at 4:18 am #

    Excellent article jason!

    How would I go about determining relationships between several variables (dependency is unknown at this time) to come up with a metric that will be used as an indicator for some output variable? Which parameters have a correlation to the output (workload over time). And how does this metric compare against the Bedford workload scale indicator.

    • Jason Brownlee November 22, 2018 at 6:26 am #

      Perhaps start by measuring simple linear correlations between variables and whether the results are significant.

  14. Totimeh Enoch Mensah December 15, 2018 at 9:53 am #

    Please can anyone help me with the formula for correlation between variables?

    Thank you

    Best regards,

  15. Imtiaz Ul hassan January 2, 2019 at 8:06 pm #

    Nice Article thumbs Up. I have Question does correlation among different variables affect the performance of the regression model other than linear regression? Like SVM or Random forest regression?

    • Jason Brownlee January 3, 2019 at 6:11 am #

      Yes, it can impact other models. Perhaps SVM, probably not random forest.

  16. Jan January 6, 2019 at 5:39 pm #

    Hi Jason,

    Could you help me to understand when should I use Theil’s U [https://en.m.wikipedia.org/wiki/Uncertainty_coefficient] and pearson’s/spearman’s Coefficient to compute the coefficient between categorical variables?

    Thank you!

    • Jason Brownlee January 7, 2019 at 6:26 am #

      Thanks for the suggestion, I may cover the topic in the future

  17. Dennis Cartin February 20, 2019 at 1:38 am #

    Dear Jason,

    Thank you. Your article is always great and good to read. To me the best part of your blog is the Q&A where I have great admire of your patience to revert all the questions raise (hope including mine 🙂 ).

    My question is what particular type of correlation we are look at in our feature selection for classification problem? is it Positive Correlation? Neutral Correlation? Negative Correlation?

    Hope my question make sense to you.

    Thank you again.

    • Jason Brownlee February 20, 2019 at 8:09 am #

      Positive or negative linear correlation is a good start.

  18. dennis cartin February 20, 2019 at 6:15 pm #

    Thanks Jason. Good to know your thought on the matter.

  19. Ashish February 26, 2019 at 1:32 pm #

    HI Jason,

    For a binary classification task, with a numerical data set having continuous values for all variables other than target. Can we use corr() to determine relationship among variables?

    In a few places applying corr() was questioned.

    Thanks.

  20. Anjana March 25, 2019 at 6:33 pm #

    Can we calculate Pearson’s Correlation Co-efficient if the target is binary? I have seen it being used in many places. But how does it work?

    Also, does it make sense to calculate the correlation between categorical features with the target (binary or continuous)?

    In Python, to calculate correlation, we can use corr() or pearsonr(). What is the difference?

  21. Vasso April 21, 2019 at 3:10 pm #

    Hi Jason! Great work. I was wondering, can you propose a media source that presents a relationship between two variables?

  22. Prakash April 25, 2019 at 8:04 pm #

    Need Quick help!!
    To Measure similarity, Is the Cross correlation only way?. I know Dynamic Time Warping(DTW) which is time consuming for the signal classification problem I am working on. I need an algorithm which can work on time related instead of frequency stuff like Fast Fourier Transform.

    • Jason Brownlee April 26, 2019 at 8:32 am #

      Sorry, I don’t have any tutorials on calculating the similarity between time series. I hope to cover the topic in the future.

  23. Buddy April 28, 2019 at 1:25 pm #

    Hi Jason,

    I have created a spearman rank correlation matrix where each comparison is between randomly sampled current density maps. The 10 maps have been generated via Circuitscape (using circuit theory) each with a unique range of cost values (all with three ranks: low, medium, high.Eg. 1,2,3 or 10,100,1000) used to generate each current density map. To summarize, overall mean correlation was 0.79 with an overall range of 0.52 to 0.99 within the matrix. The 4 maps with cost value ranges where the factorial change from medium to high is a fraction, and is also smaller than the low to medium factorial change (eg 1,2,3 (2 & 1.5) or 1,150,225 (150 & 1.5) or 1,1.5,2.25 (1.5 & 1.5), had mean ranges of .67 to .75. The 6 others were whole numbers with either increasing or even factorial changes ranging from 0.82 to .85 as mean correlation values.

    Is there a particular reason why, in the cost value ranges, the second factorial change being smaller than the first and also being a fraction (or containing a decimal place, if you will) would lower the correlation values?

    Thank you

  24. Sandor Kecskes May 24, 2019 at 6:32 pm #

    Hi Jason,

    It was a great article. For a quick correlation, I found this tool:

    https://www.answerminer.com/calculators/correlation-test

    Thanks

  25. Mohit May 29, 2019 at 2:53 am #

    Hello James,

    How do we find a correlation between two rows or two columns of the dataset If we do not have any domain knowledge and there are high numbers of rows and columns in the dataset?

    • Mohit May 29, 2019 at 2:54 am #

      *Jason

    • Jason Brownlee May 29, 2019 at 8:55 am #

      You can find correlation between columns, e.g. features.

      Not between rows, but across rows for two features.

      • Mohit May 29, 2019 at 8:08 pm #

        Thank you Jason.

        Could you please give an example of how to find a correlation between the columns?

  26. Upender Singh June 7, 2019 at 7:46 pm #

    hi Jason,

    suppose given two variable
    data1 = 20 * randn(1000) + 100
    data2 = data1 + (10 * randn(1000) + 50)

    i am confuse when i get 0.8 mean high correlation if i get 0 then which one variable will discard?

  27. Christo June 25, 2019 at 1:12 pm #

    Hi Jason,

    Thanks for the wonderful article.I am a newbie.
    Is the following LoC correct ?

    corr,_ =pearsonr(0.599122807018,0.674224021592)

    • Jason Brownlee June 25, 2019 at 2:23 pm #

      What is LoC?

      • Christo June 25, 2019 at 2:29 pm #

        Line of Code

    • Christo June 25, 2019 at 2:28 pm #

      Hi Jason,

      I know the question above is dumb since correlation might produce NaN. My intended question was: How to find correlation between classification accuracies of different classifiers and compare?
      In this case say for example the accuracy of Knn is 0.59 and that of DT is 0.67.

      Please tell me a way to do so in order to choose best few classifiers for creating an ensemble from many.

      Thank You

      • Jason Brownlee June 26, 2019 at 6:32 am #

        In choosing models for an ensemble, we would monitor the correlation between classifiers based on their prediction error on a test set, not on their summary statistics like accuracy scores.

        For classification, we might look at the correlation across the predicted probabilities for example.

  28. Emile JORDAAN July 16, 2019 at 6:17 pm #

    Hi Jason

    I have a sensor data set. The sensor data is strongly (positively) correlated with temperature. As temperature moves, the sensor values drift with the temperature. I need to compensate for this temperature-induced drift. I therefore need an algorithm to offset (neutralize) the effect of the temperature on the primary variable I am measuring.

    Can you help?

  29. Minh Hieu July 18, 2019 at 5:23 pm #

    Hi Jason,

    Thank you very much for your useful tutorial post.

    I do not have a strong base of statistics, i would like to ask which coefficient is suitable for the case that considers both categorical and continuous variables in a correletation matrix?

    Thank you!

    • Jason Brownlee July 19, 2019 at 9:10 am #

      You must use different measures.

      Maybe Pearson correlation for real-valued variables and maybe chi-squared for categorical variables.

  30. gepp October 3, 2019 at 1:45 am #

    How to perform a one-side test? When you know the type of correlation (psotive for example) you should looking for?

    Thanks for your help

Leave a Reply