Statistics for Machine Learning (7-Day Mini-Course)

By Jason Brownlee on August 8, 2019 in Statistics 328

Statistics for Machine Learning Crash Course.

Get on top of the statistics used in machine learning in 7 Days.

Statistics is a field of mathematics that is universally agreed to be a prerequisite for a deeper understanding of machine learning.

Although statistics is a large field with many esoteric theories and findings, the nuts and bolts tools and notations taken from the field are required for machine learning practitioners. With a solid foundation of what statistics is, it is possible to focus on just the good or relevant parts.

In this crash course, you will discover how you can get started and confidently read and implement statistical methods used in machine learning with Python in seven days.

This is a big and important post. You might want to bookmark it.

Kick-start your project with my new book Statistics for Machine Learning, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Statistics for Machine Learning (7-Day Mini-Course)
Photo by Graham Cook, some rights reserved.

Who Is This Crash-Course For?

Before we get started, let’s make sure you are in the right place.

This course is for developers that may know some applied machine learning. Maybe you know how to work through a predictive modeling problem end-to-end, or at least most of the main steps, with popular tools.

The lessons in this course do assume a few things about you, such as:

You know your way around basic Python for programming.
You may know some basic NumPy for array manipulation.
You want to learn statistics to deepen your understanding and application of machine learning.

You do NOT need to know:

You do not need to be a math wiz!
You do not need to be a machine learning expert!

This crash course will take you from a developer that knows a little machine learning to a developer who can navigate the basics of statistical methods.

Note: This crash course assumes you have a working Python3 SciPy environment with at least NumPy installed. If you need help with your environment, you can follow the step-by-step tutorial here:

How to Set Up a Python Environment for Machine Learning and Deep Learning with Anaconda

Crash-Course Overview

This crash course is broken down into seven lessons.

You could complete one lesson per day (recommended) or complete all of the lessons in one day (hardcore). It really depends on the time you have available and your level of enthusiasm.

Below is a list of the seven lessons that will get you started and productive with statistics for machine learning in Python:

Lesson 01: Statistics and Machine Learning
Lesson 02: Introduction to Statistics
Lesson 03: Gaussian Distribution and Descriptive Stats
Lesson 04: Correlation Between Variables
Lesson 05: Statistical Hypothesis Tests
Lesson 06: Estimation Statistics
Lesson 07: Nonparametric Statistics

Each lesson could take you 60 seconds or up to 30 minutes. Take your time and complete the lessons at your own pace. Ask questions and even post results in the comments below.

The lessons expect you to go off and find out how to do things. I will give you hints, but part of the point of each lesson is to force you to learn where to go to look for help on and about the statistical methods and the NumPy API and the best-of-breed tools in Python (hint: I have all of the answers directly on this blog; use the search box).

Post your results in the comments; I’ll cheer you on!

Hang in there; don’t give up.

Note: This is just a crash course. For a lot more detail and fleshed-out tutorials, see my book on the topic titled “Statistical Methods for Machine Learning.”

Need help with Statistics for Machine Learning?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Lesson 01: Statistics and Machine Learning

In this lesson, you will discover the five reasons why a machine learning practitioner should deepen their understanding of statistics.

1. Statistics in Data Preparation

Statistical methods are required in the preparation of train and test data for your machine learning model.

This includes techniques for:

Outlier detection.
Missing value imputation.
Data sampling.
Data scaling.
Variable encoding.

And much more.

A basic understanding of data distributions, descriptive statistics, and data visualization is required to help you identify the methods to choose when performing these tasks.

2. Statistics in Model Evaluation

Statistical methods are required when evaluating the skill of a machine learning model on data not seen during training.

This includes techniques for:

Data sampling.
Data resampling.
Experimental design.

Resampling techniques such as k-fold cross-validation are often well understood by machine learning practitioners, but the rationale for why this method is required is not.

3. Statistics in Model Selection

Statistical methods are required when selecting a final model or model configuration to use for a predictive modeling problem.

These include techniques for:

Checking for a significant difference between results.
Quantifying the size of the difference between results.

This might include the use of statistical hypothesis tests.

4. Statistics in Model Presentation

Statistical methods are required when presenting the skill of a final model to stakeholders.

This includes techniques for:

Summarizing the expected skill of the model on average.
Quantifying the expected variability of the skill of the model in practice.

This might include estimation statistics such as confidence intervals.

5. Statistics in Prediction

Statistical methods are required when making a prediction with a finalized model on new data.

This includes techniques for:

Quantifying the expected variability for the prediction.

This might include estimation statistics such as prediction intervals.

Your Task

For this lesson, you must list three reasons why you personally want to learn statistics.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will discover a concise definition of statistics.

Lesson 02: Introduction to Statistics

In this lesson, you will discover a concise definition of statistics.

Statistics is a required prerequisite for most books and courses on applied machine learning. But what exactly is statistics?

Statistics is a subfield of mathematics. It refers to a collection of methods for working with data and using data to answer questions.

It is because the field is comprised of a grab bag of methods for working with data that it can seem large and amorphous to beginners. It can be hard to see the line between methods that belong to statistics and methods that belong to other fields of study.

When it comes to the statistical tools that we use in practice, it can be helpful to divide the field of statistics into two large groups of methods: descriptive statistics for summarizing data, and inferential statistics for drawing conclusions from samples of data.

Descriptive Statistics: Descriptive statistics refer to methods for summarizing raw observations into information that we can understand and share.
Inferential Statistics: Inferential statistics is a fancy name for methods that aid in quantifying properties of the domain or population from a smaller set of obtained observations called a sample.

Your Task

For this lesson, you must list three methods that can be used for each descriptive and inferential statistics.

Post your answer in the comments below. I would love to see what you discover.

In the next lesson, you will discover the Gaussian distribution and how to calculate summary statistics.

Lesson 03: Gaussian Distribution and Descriptive Stats

In this lesson, you will discover the Gaussian distribution for data and how to calculate simple descriptive statistics.

A sample of data is a snapshot from a broader population of all possible observations that could be taken from a domain or generated by a process.

Interestingly, many observations fit a common pattern or distribution called the normal distribution, or more formally, the Gaussian distribution. It is the bell-shaped distribution that you may be familiar with.

A lot is known about the Gaussian distribution, and as such, there are whole sub-fields of statistics and statistical methods that can be used with Gaussian data.

Any Gaussian distribution, and in turn any data sample drawn from a Gaussian distribution, can be summarized with just two parameters:

Mean. The central tendency or most likely value in the distribution (the top of the bell).
Variance. The average difference that observations have from the mean value in the distribution (the spread).

The units of the mean are the same as the units of the distribution, although the units of the variance are squared, and therefore harder to interpret. A popular alternative to the variance parameter is the standard deviation, which is simply the square root of the variance, returning the units to be the same as those of the distribution.

The mean, variance, and standard deviation can be calculated directly on data samples in NumPy.

The example below generates a sample of 100 random numbers drawn from a Gaussian distribution with a known mean of 50 and a standard deviation of 5 and calculates the summary statistics.

# calculate summary stats
from numpy.random import seed
from numpy.random import randn
from numpy import mean
from numpy import var
from numpy import std
# seed the random number generator
seed(1)
# generate univariate observations
data = 5 * randn(10000) + 50
# calculate statistics
print('Mean: %.3f' % mean(data))
print('Variance: %.3f' % var(data))
print('Standard Deviation: %.3f' % std(data))

# calculate summary stats

from numpy.random import seed

from numpy.random import randn

from numpy import mean

from numpy import var

from numpy import std

# seed the random number generator

seed(1)

# generate univariate observations

data = 5 * randn(10000) + 50

# calculate statistics

print('Mean: %.3f' % mean(data))

print('Variance: %.3f' % var(data))

print('Standard Deviation: %.3f' % std(data))

Run the example and compare the estimated mean and standard deviation from the expected values.

Your Task

For this lesson, you must implement the calculation of one descriptive statistic from scratch in Python, such as the calculation of a sample mean.

Post your answer in the comments below. I would love to see what you discover.

In the next lesson, you will discover how to quantify the relationship between two variables.

Lesson 04: Correlation Between Variables

In this lesson, you will discover how to calculate a correlation coefficient to quantify the relationship between two variables.

Variables in a dataset may be related for lots of reasons.

It can be useful in data analysis and modeling to better understand the relationships between variables. The statistical relationship between two variables is referred to as their correlation.

A correlation could be positive, meaning both variables move in the same direction, or negative, meaning that when one variable’s value increases, the other variables’ values decrease.

Positive Correlation: Both variables change in the same direction.
Neutral Correlation: No relationship in the change of the variables.
Negative Correlation: Variables change in opposite directions.

The performance of some algorithms can deteriorate if two or more variables are tightly related, called multicollinearity. An example is linear regression, where one of the offending correlated variables should be removed in order to improve the skill of the model.

We can quantify the relationship between samples of two variables using a statistical method called Pearson’s correlation coefficient, named for the developer of the method, Karl Pearson.

The pearsonr() NumPy function can be used to calculate the Pearson’s correlation coefficient for samples of two variables.

The complete example is listed below showing the calculation where one variable is dependent upon the second.

# calculate correlation coefficient
from numpy.random import seed
from numpy.random import randn
from scipy.stats import pearsonr
# seed random number generator
seed(1)
# prepare data
data1 = 20 * randn(1000) + 100
data2 = data1 + (10 * randn(1000) + 50)
# calculate Pearson's correlation
corr, p = pearsonr(data1, data2)
# display the correlation
print('Pearsons correlation: %.3f' % corr)

# calculate correlation coefficient

from numpy.random import seed

from numpy.random import randn

from scipy.stats import pearsonr

# seed random number generator

seed(1)

# prepare data

data1 = 20 * randn(1000) + 100

data2 = data1 + (10 * randn(1000) + 50)

# calculate Pearson's correlation

corr, p = pearsonr(data1, data2)

# display the correlation

print('Pearsons correlation: %.3f' % corr)

Run the example and review the calculated correlation coefficient.

Your Task

For this lesson, you must load a standard machine learning dataset and calculate the correlation between each pair of numerical variables.

Post your answer in the comments below. I would love to see what you discover.

In the next lesson, you will discover statistical hypothesis tests.

Lesson 05: Statistical Hypothesis Tests

In this lesson, you will discover statistical hypothesis tests and how to compare two samples.

Data must be interpreted in order to add meaning. We can interpret data by assuming a specific structure our outcome and use statistical methods to confirm or reject the assumption.

The assumption is called a hypothesis and the statistical tests used for this purpose are called statistical hypothesis tests.

The assumption of a statistical test is called the null hypothesis, or hypothesis zero (H0 for short). It is often called the default assumption, or the assumption that nothing has changed. A violation of the test’s assumption is often called the first hypothesis, hypothesis one, or H1 for short.

Hypothesis 0 (H0): Assumption of the test holds and is failed to be rejected.
Hypothesis 1 (H1): Assumption of the test does not hold and is rejected at some level of significance.

We can interpret the result of a statistical hypothesis test using a p-value.

The p-value is the probability of observing the data, given the null hypothesis is true.

A large probability means that the H0 or default assumption is likely. A small value, such as below 5% (o.05) suggests that it is not likely and that we can reject H0 in favor of H1, or that something is likely to be different (e.g. a significant result).

A widely used statistical hypothesis test is the Student’s t-test for comparing the mean values from two independent samples.

The default assumption is that there is no difference between the samples, whereas a rejection of this assumption suggests some significant difference. The tests assumes that both samples were drawn from a Gaussian distribution and have the same variance.

The Student’s t-test can be implemented in Python via the ttest_ind() SciPy function.

Below is an example of calculating and interpreting the Student’s t-test for two data samples that are known to be different.

# student's t-test
from numpy.random import seed
from numpy.random import randn
from scipy.stats import ttest_ind
# seed the random number generator
seed(1)
# generate two independent samples
data1 = 5 * randn(100) + 50
data2 = 5 * randn(100) + 51
# compare samples
stat, p = ttest_ind(data1, data2)
print('Statistics=%.3f, p=%.3f' % (stat, p))
# interpret
alpha = 0.05
if p > alpha:
	print('Same distributions (fail to reject H0)')
else:
	print('Different distributions (reject H0)')

# student's t-test

from numpy.random import seed

from numpy.random import randn

from scipy.stats import ttest_ind

# seed the random number generator

seed(1)

# generate two independent samples

data1 = 5 * randn(100) + 50

data2 = 5 * randn(100) + 51

# compare samples

stat, p = ttest_ind(data1, data2)

print('Statistics=%.3f, p=%.3f' % (stat, p))

# interpret

alpha = 0.05

if p > alpha:

print('Same distributions (fail to reject H0)')

else:

print('Different distributions (reject H0)')

Run the code and review the calculated statistic and interpretation of the p-value.

Your Task

For this lesson, you must list three other statistical hypothesis tests that can be used to check for differences between samples.

Post your answer in the comments below. I would love to see what you discover.

In the next lesson, you will discover estimation statistics as an alternative to statistical hypothesis testing.

Lesson 06: Estimation Statistics

In this lesson, you will discover estimation statistics that may be used as an alternative to statistical hypothesis tests.

Statistical hypothesis tests can be used to indicate whether the difference between two samples is due to random chance, but cannot comment on the size of the difference.

A group of methods referred to as “new statistics” are seeing increased use instead of or in addition to p-values in order to quantify the magnitude of effects and the amount of uncertainty for estimated values. This group of statistical methods is referred to as estimation statistics.

Estimation statistics is a term to describe three main classes of methods. The three main
classes of methods include:

Effect Size. Methods for quantifying the size of an effect given a treatment or intervention.
Interval Estimation. Methods for quantifying the amount of uncertainty in a value.
Meta-Analysis. Methods for quantifying the findings across multiple similar studies.

Of the three, perhaps the most useful methods in applied machine learning are interval estimation.

There are three main types of intervals. They are:

Tolerance Interval: The bounds or coverage of a proportion of a distribution with a specific level of confidence.
Confidence Interval: The bounds on the estimate of a population parameter.
Prediction Interval: The bounds on a single observation.

A simple way to calculate a confidence interval for a classification algorithm is to calculate the binomial proportion confidence interval, which can provide an interval around a model’s estimated accuracy or error.

This can be implemented in Python using the confint() Statsmodels function.

The function takes the count of successes (or failures), the total number of trials, and the significance level as arguments and returns the lower and upper bound of the confidence interval.

The example below demonstrates this function in a hypothetical case where a model made 88 correct predictions out of a dataset with 100 instances and we are interested in the 95% confidence interval (provided to the function as a significance of 0.05).

# calculate the confidence interval
from statsmodels.stats.proportion import proportion_confint
# calculate the interval
lower, upper = proportion_confint(88, 100, 0.05)
print('lower=%.3f, upper=%.3f' % (lower, upper))

# calculate the confidence interval

from statsmodels.stats.proportion import proportion_confint

# calculate the interval

lower, upper = proportion_confint(88, 100, 0.05)

print('lower=%.3f, upper=%.3f' % (lower, upper))

Run the example and review the confidence interval on the estimated accuracy.

Your Task

For this lesson, you must list two methods for calculating the effect size in applied machine learning and when they might be useful.

As a hint, consider one for the relationship between variables and one for the difference between samples.

Post your answer in the comments below. I would love to see what you discover.

In the next lesson, you will discover nonparametric statistical methods.

Lesson 07: Nonparametric Statistics

In this lesson, you will discover statistical methods that may be used when your data does not come from a Gaussian distribution.

A large portion of the field of statistics and statistical methods is dedicated to data where the distribution is known.

Data in which the distribution is unknown or cannot be easily identified is called nonparametric.

In the case where you are working with nonparametric data, specialized nonparametric statistical methods can be used that discard all information about the distribution. As such, these methods are often referred to as distribution-free methods.

Before a nonparametric statistical method can be applied, the data must be converted into a rank format. As such, statistical methods that expect data in rank format are sometimes called rank statistics, such as rank correlation and rank statistical hypothesis tests. Ranking data is exactly as its name suggests.

The procedure is as follows:

Sort all data in the sample in ascending order.
Assign an integer rank from 1 to N for each unique value in the data sample.

A widely used nonparametric statistical hypothesis test for checking for a difference between two independent samples is the Mann-Whitney U test, named for Henry Mann and Donald Whitney.

It is the nonparametric equivalent of the Student’s t-test but does not assume that the data is drawn from a Gaussian distribution.

The test can be implemented in Python via the mannwhitneyu() SciPy function.

The example below demonstrates the test on two data samples drawn from a uniform distribution known to be different.

# example of the mann-whitney u test
from numpy.random import seed
from numpy.random import rand
from scipy.stats import mannwhitneyu
# seed the random number generator
seed(1)
# generate two independent samples
data1 = 50 + (rand(100) * 10)
data2 = 51 + (rand(100) * 10)
# compare samples
stat, p = mannwhitneyu(data1, data2)
print('Statistics=%.3f, p=%.3f' % (stat, p))
# interpret
alpha = 0.05
if p > alpha:
	print('Same distribution (fail to reject H0)')
else:
	print('Different distribution (reject H0)')

# example of the mann-whitney u test

from numpy.random import seed

from numpy.random import rand

from scipy.stats import mannwhitneyu

# seed the random number generator

seed(1)

# generate two independent samples

data1 = 50 + (rand(100) * 10)

data2 = 51 + (rand(100) * 10)

# compare samples

stat, p = mannwhitneyu(data1, data2)

print('Statistics=%.3f, p=%.3f' % (stat, p))

# interpret

alpha = 0.05

if p > alpha:

print('Same distribution (fail to reject H0)')

else:

print('Different distribution (reject H0)')

Run the example and review the calculated statistics and interpretation of the p-value.

Your Task

For this lesson, you must list three additional nonparametric statistical methods.

Post your answer in the comments below. I would love to see what you discover.

This was the final lesson in the mini-course.

The End!
(Look How Far You Have Come)

You made it. Well done!

Take a moment and look back at how far you have come.

You discovered:

The importance of statistics in applied machine learning.
A concise definition of statistics and a division of methods into two main types.
The Gaussian distribution and how to describe data with this distribution using statistics.
How to quantify the relationship between the samples of two variables.
How to check for the difference between two samples using statistical hypothesis tests.
An alternative to statistical hypothesis tests called estimation statistics.
Nonparametric methods that can be used when data is not drawn from the Gaussian distribution.

This is just the beginning of your journey with statistics for machine learning. Keep practicing and developing your skills.

Take the next step and check out my book on Statistical Methods for Machine Learning.

Summary

How did you do with the mini-course?
Did you enjoy this crash course?

Do you have any questions? Were there any sticking points?
Let me know. Leave a comment below.

328 Responses to Statistics for Machine Learning (7-Day Mini-Course)

james lim August 4, 2018 at 10:08 am #

Why not R ?

Reply
- Jason Brownlee August 5, 2018 at 5:21 am #
  
  Great question, I explain why here:
  https://machinelearningmastery.com/python-growing-platform-applied-machine-learning/
  
  Reply
  - Apanta December 7, 2020 at 6:08 pm #
    
    thank you very much
    
    Reply
    - Jason Brownlee December 8, 2020 at 7:40 am #
      
      You’re welcome.
      
      Reply
  - Apanta December 7, 2020 at 8:20 pm #
    
    Thank you very much !
    
    I am also looking for a lotto 35 \ 48 random number generator code.
    
    Reply
    - Jason Brownlee December 8, 2020 at 7:42 am #
      
      Lotto cannot be predicted:
      https://machinelearningmastery.com/faq/single-faq/can-i-use-machine-learning-to-predict-the-lottery
      
      Reply
- Aradhika Acharya August 10, 2018 at 5:00 pm #
  
  To understand how the ML algorithms work behind the scenes.
  
  Reply
- Gbolahan April 12, 2020 at 5:34 am #
  
  To understand how ML works.
  To be able to work through the tutorials effectively.
  And to have confidence in getting my hands dirty on ML
  
  Reply
  - Jason Brownlee April 12, 2020 at 6:27 am #
    
    Great!
    
    Reply
Aradhika August 10, 2018 at 4:59 pm #

To understand how to the Machine Learning algorithms work behind the scenes.

Reply
Aradhika Acharya August 10, 2018 at 5:08 pm #

Descriptive – Median, Standard Deviation, Mode
Inferential – AUC, Kappa-Statistics Test, Confusion Matrix, F-1 Score

Reply
- Manik Aggarwal September 18, 2018 at 5:49 am #
  
  Hey Aradhika.. Thanks for the valuable input. Could you let me know the URL for the course. I am unable to access the same.
  
  Reply
  - Jason Brownlee September 18, 2018 at 6:26 am #
    
    This page is the course.
    
    Reply
MLData Crunch August 13, 2018 at 5:40 pm #

Inspired. Thank you for the deep description with practical codes. I really learnt a lot. Appreciate your work.

Reply
- Jason Brownlee August 14, 2018 at 6:15 am #
  
  I’m glad it helped.
  
  Reply
Nadya August 30, 2018 at 7:00 am #

Why I want to learn statistics:
– I’d like to understand what I’m doing while training a model and whether it makes sense: bias, assumptions, that matters a lot;
– I’d like to understand the difference between classical statistical and bayesian methods;
– I’d like to learn to compare models in more detail than just by looking at accuracy figures.

Reply
- Jason Brownlee August 30, 2018 at 4:48 pm #
  
  Thanks Nadya!
  
  Reply
Manik Aggarwal September 18, 2018 at 5:47 am #

@ Jason: I am unable to access the link for the mini course. Could you let me know the correct URL.

Would really appreciate it!

Reply
- Jason Brownlee September 18, 2018 at 6:25 am #
  
  What link?
  
  Reply
  - Simas April 22, 2020 at 7:05 am #
    
    Hey Jason, seems like the link to get access course is broken. After putting in my email address the download button doesn’t do anything and just keeps my cursor spinning.
    
    Reply
    - Jason Brownlee April 22, 2020 at 7:47 am #
      
      Sorry to hear that, perhaps try refreshing your browser and try again? Or try a different browser?
      
      Reply
Anirban September 29, 2018 at 2:52 am #

Can we also check correlation among input features using statistical hypothesis test?

Reply
- Jason Brownlee September 29, 2018 at 6:36 am #
  
  Sure.
  
  Reply
Anon October 11, 2018 at 10:06 am #

I would like to learn statistics to deepen my understanding of ML and have a fair background on statistics

Reply
- Jason Brownlee October 11, 2018 at 4:12 pm #
  
  Great, you’re in the right place!
  
  Reply
Mohamed October 11, 2018 at 8:07 pm #

Hello Jason,
In response of task of lesson 02, I found:
– as descriptive statistics normal (or Gaussian), binomial and Poisson distributions.
– as inferential methods we have ANOVA, t-tests and regression analysis.
Is it correct?

Reply
- Jason Brownlee October 12, 2018 at 6:38 am #
  
  Nice!
  
  Reply

Mohamed October 24, 2018 at 6:36 am #

Hi Jason,
I did the task of lesson 03 and here’s my code to calculate from scratch a sample mean.

# Calculate sample mean from scratch
from numpy.random import seed
from numpy.random import randn

# seed random number generator
seed(1)

# generate the distribution
datas = 5 * randn(100) + 50

# compute the sample mean
total = 0.0
i = 0
for element in datas:
	total += element
	i += 1
	
mean = total/i

# display sample mean
print('sample mean: %.3f' % mean)

# Calculate sample mean from scratch

from numpy.random import seed

from numpy.random import randn

# seed random number generator

seed(1)

# generate the distribution

datas = 5 * randn(100) + 50

# compute the sample mean

total = 0.0

i = 0

for element in datas:

total += element

i += 1

mean = total/i

# display sample mean

print('sample mean: %.3f' % mean)

Hope that’s the task you asked for.

Jason Brownlee October 24, 2018 at 2:37 pm #

Nice work!

Reply

Mohamed October 25, 2018 at 7:05 am #

Hi Jason,
I used local iris dataset for the task of lesson 4.
Bellow is my code to calculate correlation between each pair of sepal and petal variables

# Calculate dataset correlation coefficient
from numpy.random import seed
from numpy.random import randn
from scipy.stats import pearsonr
import pandas

# load the dataset
url = '/sdcard/Perso/dev/iris.csv'
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = pandas.read_csv(url, names=names)

# prepare data
array = dataset.values
datas = []
datas.append(('sepal length', array[:, 0]))
datas.append(('sepal width', array[:, 1]))
datas.append(('petal length', array[:, 2]))
datas.append(('petal width', array[:, 3]))

# calculate the correlation between each pair of numerical variables
for i in range(3):
   for j in range(4):
      if j>i:
         # calculate Pearson's correlation
         corr, p = pearsonr(datas[i][1], datas[j][1])
			
         # display correlation coefficient
         print('("%s","%s") correlation coefficient: %.3f' % (datas[i][0],datas[j][0],corr))

# Calculate dataset correlation coefficient

from numpy.random import seed

from numpy.random import randn

from scipy.stats import pearsonr

import pandas

# load the dataset

url = '/sdcard/Perso/dev/iris.csv'

names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']

dataset = pandas.read_csv(url, names=names)

# prepare data

array = dataset.values

datas = []

datas.append(('sepal length', array[:, 0]))

datas.append(('sepal width', array[:, 1]))

datas.append(('petal length', array[:, 2]))

datas.append(('petal width', array[:, 3]))

# calculate the correlation between each pair of numerical variables

for i in range(3):

for j in range(4):

if j>i:

# calculate Pearson's correlation

corr, p = pearsonr(datas[i][1], datas[j][1])

# display correlation coefficient

print('("%s","%s") correlation coefficient: %.3f' % (datas[i][0],datas[j][0],corr))

Jason Brownlee October 25, 2018 at 8:05 am #

Well done!

Reply

Mohamed October 31, 2018 at 7:40 am #

Hi Jason,
In replay to lesson 5 task, I found as statistical hypothesis test the following method:

– The Wald test (also called the Wald Chi-Squared Test) is a way to find out if explanatory variables in a model are sognificant. “Significant” means that they add something to the model; variables that add nothing can be deleted without affecting the model in any meaningful way.

– The Kolmogorov-Smirnov Goodness of Fit Test (K-S test) compares your data with a known distribution and lets you know if they have the same distribution.

– Granger causality test is a way to investigate causality between two variables in a time series

Reply
- Jason Brownlee October 31, 2018 at 2:50 pm #
  
  Well done!
  
  Reply
Mohamed November 3, 2018 at 2:30 am #

Hi Jason,
For lesson 6 task I found that there are more than 70 effect size measures mainly grouped into two groups:
– correlation family or measures of association, a.k.a r family. E.g:
Pearson’s r or correlation coefficient to measure correlation between dependent variables.
Eta-squared to describe the ratio of variance within dependent variables.

– difference family or difference between groups, a.k.a d family. The computation resembles to t-test statistic without being affected by the sample size. Which is not the case of t-test statistic. E.g:
Cohen’s d defined as the difference between two means for two independent samples divided by standard deviation for the data.

Reply
- Jason Brownlee November 3, 2018 at 7:09 am #
  
  Nice work!
  
  Reply
Mohamed November 7, 2018 at 5:15 am #

Hi Jason,
The three additional nonparametric statistical methods, in reply to lesson 7 task, that I found are:

Anderson-Darling test: tests whether a sample is drawn from a giving distribution

Cochran’s Q: tests whether k treatments in randomized block designs with 0/1 outcomes have the identical effects

Kendall’s tau: measures statistical dependence between two variables

Reply
- Jason Brownlee November 7, 2018 at 6:13 am #
  
  Thanks.
  
  Reply
  - Naveen Reddy Marthala July 26, 2021 at 9:24 pm #
    
    Mr Bronlee,
    
    In [17 Statistical Hypothesis Tests in Python (Cheat Sheet)](https://machinelearningmastery.com/statistical-hypothesis-tests-in-python-cheat-sheet/) article, you have mentioned that Anderson-Darling test is “Tests whether a data sample has a Gaussian distribution.”, and that conflicts with the definition Mohaamed made. Which is the true definition?
    
    Reply
    - Jason Brownlee July 27, 2021 at 5:07 am #
      
      Perhaps this will help:
      https://en.wikipedia.org/wiki/Anderson%E2%80%93Darling_test
      
      Reply
Mohamed November 7, 2018 at 7:05 pm #

Thanks to you Jason. I really enjoyed your mini course. It give me the quick introduction that I look for to that field.
Thanks again.

Reply
- Jason Brownlee November 8, 2018 at 6:05 am #
  
  Thanks.
  
  Reply
Shravankumar Shetty January 30, 2019 at 4:53 am #

I have done all the basic Machine Learning and Deep Learning from Andrew Ng’s courses, but now I’ve got an internship and it is more focusing on data analytics and getting insights from the dataset. Hence I want to learn the statistics.

Reply
- Jason Brownlee January 30, 2019 at 8:15 am #
  
  Thanks!
  
  Reply
Sean April 4, 2019 at 5:56 am #

1) I have a specific business problem I’d like to solve that involves ML and I know statistics is important for this (not just because you said so, Jason).
2) I’ve always found statistics dry due to the way its taught in classrooms, with little context and requiring a lot of procedural memorization. I’m encouraged to learn a deeper understanding will give me the opportunity to solve a relevant problem, increasing my motivation to learn more.
3) I want to be able to better speak the language of data for business intelligence reasons

Reply
- Jason Brownlee April 4, 2019 at 8:00 am #
  
  Thanks Sean.
  
  Reply
Varun May 23, 2019 at 3:51 am #

Such a beautiful article. Thanks jason for helping the machine learning community.

Reply
- Jason Brownlee May 23, 2019 at 6:08 am #
  
  Thanks, I’m glad it helped.
  
  Reply
Namrata August 17, 2019 at 5:42 pm #

Day1 task: list three reasons why you personally want to learn statistics.

1) I am interested in learning machine learning and its implementation in the real-world scenarios

2) As you mentioned in 1st day, how the statistic is used in all phases of machine learning

3) Only knowing ML algorithms is not enough, according to me statistics is also important to get useful insights from data.

Reply
- Jason Brownlee August 18, 2019 at 6:40 am #
  
  Thanks!
  
  Reply
Floris September 4, 2019 at 2:36 am #

3 Reasons that made me want to learn statistics:

1. I currently have a deep learning project for an internship. Statistics are essential for machine learning and machine learning is essential for deep learning. See how it goes? ;D

2. I study computer science, learning what statistics is all about (in general) will help me broaden my mind in other scientific fields out of programming.

3. I currently suck at math, learning a subset field of math will gradually make me one step better at them.

Reply
- Jason Brownlee September 4, 2019 at 6:02 am #
  
  Thanks Floris!
  
  Reply
Ernest Chauke September 13, 2019 at 5:40 am #

I will like to this book or a down load

Reply
- Jason Brownlee September 13, 2019 at 5:46 am #
  
  You can learn more about the book here:
  https://machinelearningmastery.com/statistics_for_machine_learning/
  
  Reply
Marcello Graziano September 22, 2019 at 4:35 am #

1. I have to prepare model presentation to stackholders
2. I need to sell sw solution that include machine learning models
3. I’m an engineer

Reply
- Jason Brownlee September 22, 2019 at 9:35 am #
  
  Thanks!
  
  Reply
Marcello Graziano September 22, 2019 at 7:05 pm #

Answer to your lesson 2. Descriptive: frequency, central tendency, variation.. Inferential: Variance (ANOVA), Analysis of Covariance (ANCOVA), regression analysis

Reply
- Jason Brownlee September 23, 2019 at 6:37 am #
  
  Nice work!
  
  Reply

Marcello Graziano September 22, 2019 at 7:47 pm #

Answer to your lesson 3 (i hope this is right):

# Calculate sample mean from scratch
from numpy.random import seed
from numpy.random import gauss

# seed random number generator
seed(1)

# compute the sample mean
total = 0.0
i = 0
for _ in range (1000):
	total += gauss(50,5)
	i += 1
mean = total/i

# display sample mean
print('sample mean: %.3f' % mean)

# Calculate sample mean from scratch

from numpy.random import seed

from numpy.random import gauss

# seed random number generator

seed(1)

# compute the sample mean

total = 0.0

i = 0

for _ in range (1000):

total += gauss(50,5)

i += 1

mean = total/i

# display sample mean

print('sample mean: %.3f' % mean)

Jason Brownlee September 23, 2019 at 6:38 am #

Nice!

Reply

Marcello Graziano September 24, 2019 at 12:11 am #

Hi Jason, this is the core of code for your question number 4 (i only include the final calculation considering in datas al the informations already structured.

# calculate the correlation between each pair of numerical variables
for i in range(0,1):
   for j in range(0,4):
         # calculate Pearson's correlation
         corr, p = pearsonr(datas[i][j], datas[i+2][j])

         # display correlation coefficient
         print('Pearsons correlation: %.3f' % corr)

# calculate the correlation between each pair of numerical variables

for i in range(0,1):

for j in range(0,4):

# calculate Pearson's correlation

corr, p = pearsonr(datas[i][j], datas[i+2][j])

# display correlation coefficient

print('Pearsons correlation: %.3f' % corr)

Jason Brownlee September 24, 2019 at 7:47 am #

Nice work.

Reply

Marcello Graziano September 26, 2019 at 12:24 am #

Jason, my answer for lesson 05:
Z-test that use sample and population mean and sample and population standard variation to verify the null Hipothesys, is the sample mean the same than the population mean?

Anova compare differences between three or ore sample. Null hipothesys is all smaple means are equal

Chi square test compare categorical variables and if a sample match a population. Null hipothesys is variable a and b are independent (a sample match a population)

Reply
- Jason Brownlee September 26, 2019 at 6:42 am #
  
  Nice work.
  
  Reply
Marcello September 29, 2019 at 4:17 am #

Task for lesson 06:

There are two types of statistics that describe the size of an effect.

The first type is standardized, this type remove the units of the variables in the effect.

The second type is simple and describe the size of the effect, but remain in the original units of the variables.

Comparing the mean temperature under two different conditions.

The simple effect size would be the difference in the mean temperature in degrees Celsius.

The standardized effect size statistic would divide that mean difference by the standard deviation.

So if you have two conditions for temperature simple effect size would result in the mean temperature in condition 1 is 23 degres higher than in condition 2.
Standardized effect size would result in the mean temperature in condition 1 is 1.8 standard variation higher than in condition 2.

I hope this is the correct

Reply
- Jason Brownlee September 29, 2019 at 6:17 am #
  
  Thanks.
  
  Reply
Marcello September 29, 2019 at 8:10 pm #

Lesson 07:

1) Kruskal–Wallis test of the hypothesis that several samples are from the same
population. This test is a multisample generalization of the two-sample Wilcoxon (Mann–Whitney) rank-sum test.

2) Cusum graphs the cumulative sum (cusum) of a binary (0/1) variable, yvar, against a (usually) continuous variable, xvar.

3) Trend test performs a nonparametric test for trend across ordered groups

There any many others methods. Thanks for this course that has been very useful for me. I was searching for something that helps me to understand basic for machine learning

Reply
- Jason Brownlee September 30, 2019 at 6:06 am #
  
  Very well done, thanks for posting all of your answers!
  
  Reply
Alfredo Rodriguez November 5, 2019 at 1:07 pm #

Lesson 1:

1) I have always had some curiosity on AI and how it work.

2) Machine learning has such a big field for its uses.

3) This is one of the fields of computer science that I like the most.

4) Knowing that there are some things you can really predict with certain amount of accurary is something that I would definitely want to know (bonus)

Reply
- Jason Brownlee November 5, 2019 at 1:40 pm #
  
  Nice work!
  
  Reply
Alfredo Rodriguez November 6, 2019 at 1:45 pm #

Lesson 2:

Descriptive statistics:

* Dispersion
* Standard Deviation
* Kurtosis and Skewness

Inferentia statistics:

* Analysis of Covariance (Ancova)
* Factor Analysis
* Cluster Analysis

Reply
- Jason Brownlee November 6, 2019 at 2:17 pm #
  
  Well done!
  
  Reply
Mounika R November 26, 2019 at 12:03 pm #

I am interested in learning statistics as I was always fascinated by how statistics can be made use of in machine learning.

Reply
- Jason Brownlee November 26, 2019 at 1:32 pm #
  
  Thanks!
  
  Reply
Mounika R November 26, 2019 at 12:38 pm #

Descriptive Statistics:

1. Mean, median, mode
2. Skewness and kurtosis
3. Variance and standard deviation

Inferential Statistics:

1. Estimation
1. Maximum likelihood estimation
2. Density estimation
2. Hypothesis testing
3. Confidence intervals

Reply
- Jason Brownlee November 26, 2019 at 1:32 pm #
  
  Well done!
  
  Reply
Prachi Ramesh November 26, 2019 at 1:02 pm #

Hi Jason, thanks for spreading the knowledge.
Day 1:
1. I like to work across different disciplines and stat is the crux to understanding or discover insights from any data. for one descriptive stat, central tendency and much more
2. to understand data interpretability at depth. As stat is the interpretive language of understanding data.
4. understand to apply right method to the right kind of data.

Reply
- Jason Brownlee November 26, 2019 at 1:32 pm #
  
  Thanks for sharing!
  
  Reply
Allan Freitas December 7, 2019 at 1:34 pm #

#Lesson 2
#For this lesson, you must implement the calculation of one descriptive statistic from scratch in #Python, such as the calculation of a sample mean.

#I applied this sample with Iris dataset:

import numpy as np
import math
from sklearn import datasets
iris = datasets.load_iris()

#Attributes
#1. sepal length in cm
#2. sepal width in cm
#3. petal length in cm
#4. petal width in cm

X = iris.data
print(X.size)
print(X.shape)

#column 0..all lines
sepal_lenghts = X[: , 0]

print(sepal_lenghts.size)
print(sepal_lenghts.shape)

#same thing was done above
sepal_width = X[:,1]
petal_lenght = X[:,2]
petal_width = X[:,3]

#Calculate the mean, variance and standard deviation “by hand”! ————-##
#Mean “by hand” ——————-##
def mean_by_hand(data):
i_arr_summation = 0
for x in np.nditer(data):
i_arr_summation += x

size_data = data.size
mean_data = i_arr_summation / size_data
return mean_data

#Variance “by hand” ——————————————————-###
def variance_by_hand(data, mean_data, n_data):
sum_var = 0
for x in np.nditer(data):
i_var = x – mean_data #variance (xi – mi)
i_var *= i_var # ^2
sum_var += i_var #summation
variance = (1/n_data) * sum_var
return variance

#Standard deviation “by hand”. Are you serious?! ————————–####
def standard_dev_by_hand(variance):
standard_dev = math.sqrt(variance) #or variance**0.5
return standard_dev

#Calling the functions to calculate mean, var and std ———–##############
#Mean ————————————————####
mean_sepal_lenghts = mean_by_hand(sepal_lenghts)
print(“mean sepal_lenght:”, mean_sepal_lenghts)
print(“NUMPY mean sepal_lenght:”, np.mean(sepal_lenghts))

#Variance ————————————————####
n_sepal_lenghts = sepal_lenghts.size
var_sepal_lenghts = variance_by_hand(sepal_lenghts, mean_sepal_lenghts, n_sepal_lenghts)
print(“var sepal_lenght:”, var_sepal_lenghts)
print(“NUMPY var sepal_lenght:”, np.var(sepal_lenghts))

#Standard deviation————————————–####
std_sepal_lengths = standard_dev_by_hand(var_sepal_lenghts)
print(“std sepal_lenght:”, std_sepal_lengths)
print(“NUMPY std sepal_lenght:”, np.std(sepal_lenghts))

Reply
- Allan Freitas December 7, 2019 at 1:39 pm #
  
  Corrected: #Lesson 03: Gaussian Distribution and Descriptive Stats
  
  Reply
- Jason Brownlee December 8, 2019 at 6:06 am #
  
  Nice work!
  
  Reply
Allan Freitas December 7, 2019 at 1:53 pm #

#lesson 4: Correlation between variables
#I applied this sample in Iris dataset, specifically in atts sepal_lenght and sepal_width to
#discover if they are correlated or not

import numpy as np
from sklearn import datasets
iris = datasets.load_iris()

# calculate correlation coefficient
from numpy.random import seed
from numpy.random import randn
from scipy.stats import pearsonr

#Attributes
#1. sepal length in cm
#2. sepal width in cm
#3. petal length in cm
#4. petal width in cm

X = iris.data
print(X.size)
print(X.shape)

#column 0..all lines
sepal_lenghts = X[: , 0]
sepal_width = X[:,1]

print(sepal_lenghts)
type(sepal_lenghts)
print(sepal_lenghts.shape)
print(sepal_lenghts.size)

print(sepal_width)
type(sepal_width)
print(sepal_width.shape)
print(sepal_width.size)

# calculate Pearson’s correlation
corr, p = pearsonr(sepal_lenghts, sepal_width)

# display the correlation: in this case, NEGATIVE CORRELATION
print(‘Pearsons correlation: %.3f’ % corr)

Reply
- Jason Brownlee December 8, 2019 at 6:06 am #
  
  Thanks for sharing!
  
  Reply
mohammad March 3, 2020 at 7:28 pm #

Hi!
#Lesson 1
List three reasons why you personally want to learn statistics?
1- recently I understand, machine learning based on estimation and Probabilities. this encourage me to learn statistic.
2- Statistics give me insight for better understanding data.
3- Machine learning solve the real problem in the world, and in real problem are based on Statistic.

Reply
- Jason Brownlee March 4, 2020 at 5:52 am #
  
  Nice work!
  
  Reply
Harika March 13, 2020 at 12:35 pm #

Hi
Lesson1 :
1. I am interested to learn the underlying statistics in Machine Learning
2. It helps me to become good data scientist
3. Even if new models come up in ml, the stats doesn’t change so I can upgrade myself easily.

Reply
- Jason Brownlee March 13, 2020 at 1:49 pm #
  
  Thanks for sharing!
  
  Reply
Natasha March 24, 2020 at 1:32 pm #

Hi Jason,
Lesson1: List 3 reasons why you personally want to learn statistics
1. I’m always looking for new, easy to follow, yet comprehensive statistics exercise
2. I’m interested in learning about Machine Learning with examples
3. Hopefully i can apply some aspect of it towards my dissertation in geosciences.
Thank you,

Reply
- Jason Brownlee March 24, 2020 at 1:45 pm #
  
  Well done!
  
  Reply
Dominique April 19, 2020 at 9:38 pm #

Hi Jason

Lesson #1

I want to make a better link between statistics and ML.

I have already recently followed a MOOC on Statistics with R (a post about my personal usage of statistics and R as a result of this course in http://questioneurope.blogspot.com) and I wand to complete the course with yours.

Kind regards,
Dominique

Reply
- Jason Brownlee April 20, 2020 at 5:27 am #
  
  Thanks for sharing.
  
  Reply
Dominique April 21, 2020 at 1:59 am #

Hi Jason,

Lesson #2

Descriptive statistics methods :
a) Spearman correlation: for non Gaussian
b) Fisher test: to obtain the odd ratio
c) Chi2 test: for observations of large size.

Inferential statistics methods:
a) multiple linear regression
b) logistic regression
c) Principal Component Analysis (PCA)

Kind regards,
Dominique

Reply
- Jason Brownlee April 21, 2020 at 6:02 am #
  
  Well done!
  
  Reply
Dominique April 21, 2020 at 3:44 pm #

Hi Jason,

Lesson #3

# calculate summary stats
from numpy import mean
from numpy import var
from numpy import std

# create a simple list
mylist=[1,2,3,4,5,6,7,8,9,10]

# calculate statistics
print(‘Mean: %.3f’ % mean(mylist))
print(‘Variance: %.3f’ % var(mylist))
print(‘Standard Deviation: %.3f’ % std(mylist))

Question: how do you insert the nice snippet of code in the comment?

Thanks,
Dominique

Reply
- Jason Brownlee April 22, 2020 at 5:48 am #
  
  Well done!
  
  You can use the PRE html tag.
  
  Reply

dominique April 22, 2020 at 9:32 pm #

Dear Jason,

Lesson #4 Correlation.

I am using the red wine quality dataset.

The results for correlation are:

Pearsons correlation between quality and alcohol is: 0.476
Pearsons correlation between quality and sulphates is: 0.251
Pearsons correlation between quality and chlorides is: -0.129

The code is below:

# calculate summary stats
from numpy import mean
from numpy import var
from numpy import std

# create a simple list
mylist=[1,2,3,4,5,6,7,8,9,10]

# calculate statistics
print('Mean: %.3f' % mean(mylist)) 
print('Variance: %.3f' % var(mylist)) 
print('Standard Deviation: %.3f' % std(mylist))


# Compute correlation
# calculate correlation coefficient

from scipy.stats import pearsonr

 # Load red wine dataset data using read_csv
import pandas as pd
sr = pd.read_csv('winequality-red.csv', delimiter=';')

# Print the first few rows using the head() function.
print(f'\nThe type is: {type(sr)}')

print(sr.head(10))

# prepare data
data1 = sr.quality
data2 = sr.alcohol
data3 = sr.sulphates
data4 = sr.chlorides
# calculate Pearson's correlation
corr, p = pearsonr(data1, data2)
# display the correlation
print('Pearsons correlation between quality and alcohol is: %.3f' % corr)

corr, p = pearsonr(data1, data3)
# display the correlation
print('Pearsons correlation between quality and sulphates is: %.3f' % corr)

corr, p = pearsonr(data1, data4)
# display the correlation
print('Pearsons correlation between quality and chlorides is: %.3f' % corr)

# calculate summary stats

from numpy import mean

from numpy import var

from numpy import std

# create a simple list

mylist=[1,2,3,4,5,6,7,8,9,10]

# calculate statistics

print('Mean: %.3f' % mean(mylist))

print('Variance: %.3f' % var(mylist))

print('Standard Deviation: %.3f' % std(mylist))

# Compute correlation

# calculate correlation coefficient

from scipy.stats import pearsonr

# Load red wine dataset data using read_csv

import pandas as pd

sr = pd.read_csv('winequality-red.csv', delimiter=';')

# Print the first few rows using the head() function.

print(f'\nThe type is: {type(sr)}')

print(sr.head(10))

# prepare data

data1 = sr.quality

data2 = sr.alcohol

data3 = sr.sulphates

data4 = sr.chlorides

# calculate Pearson's correlation

corr, p = pearsonr(data1, data2)

# display the correlation

print('Pearsons correlation between quality and alcohol is: %.3f' % corr)

corr, p = pearsonr(data1, data3)

# display the correlation

print('Pearsons correlation between quality and sulphates is: %.3f' % corr)

corr, p = pearsonr(data1, data4)

# display the correlation

print('Pearsons correlation between quality and chlorides is: %.3f' % corr)

Jason Brownlee April 23, 2020 at 6:05 am #

Well done!

Reply

Dominique April 25, 2020 at 6:21 pm #

Hi Jason,

Lesson #5 Statistical Hypothesis Tests.

List three other statistical hypothesis tests that can be used to check for differences between samples:

Mann-Whitney (Wilcoxon) test: compare two means from two samples which are independent or paired. in R language: Wilcox.test()
Fisher test : is a way to test if the observed frequencies on two samples are identical. Only for sample of small size. This test is a way to know the odd-ratio. In R: fisher.test()
For the samples of big sizes, the chi-2 test can be used. In R: chisel.test()

Reply
- Jason Brownlee April 26, 2020 at 6:07 am #
  
  Nice work.
  
  Reply
Dominique April 26, 2020 at 3:38 pm #

Hi Jason,

Lesson #6 Estimation statistics

For the relationship between variables: Pearson or R2 (coefficient of determination)

For the difference between samples: Cohen’s , odds ratio (OR) or Relative Risk (RR) ratio. OR and RR can be computed by the function twoby2 in R.

Thank you very much
Dominique

Reply
- Jason Brownlee April 27, 2020 at 5:29 am #
  
  Well done!
  
  Reply
Dominique April 28, 2020 at 3:56 pm #

Hi Jason,

Lesson #7: non parametric statistical method

3 examples of non parametric statistical method:
a) Spearman: if any of the variables is in a normal law
b) MCNemar: need observations at different epochs on the same candidates
c) Kaplan-Meier used for survival estimation

Kind regards,

Dominique

Reply
- Jason Brownlee April 29, 2020 at 6:16 am #
  
  Well done!
  
  Reply
Steven May 5, 2020 at 8:45 am #

I would like to learn statistics because it will help me improve my data preparation and model evaluation skills.

Reply
- Jason Brownlee May 5, 2020 at 1:36 pm #
  
  Thanks!
  
  Reply
Steven May 5, 2020 at 9:18 am #

Lesson #2:
Descriptive Statistics methods: Measures of central tendency, and Measures of spread.
Inferential Statistics methods: Estimation of the parameter(s), and testing of statistical hypotheses.

Reply
- Jason Brownlee May 5, 2020 at 1:36 pm #
  
  Nice work.
  
  Reply

Steven May 5, 2020 at 10:44 am #

Lesson #3:

from numpy.random import seed
from numpy.random import randn
import numpy as np

seed(1)
data = 5 * randn(10000) + 50

def my_median(data):
    data = sorted(data)
    size = len(data)
    mid_idx = size // 2
    if size % 2 == 1:
        return data[mid_idx]
    idx1 = mid_idx - 1
    return (data[idx1] + data[mid_idx]) / 2

print('Mean: {}'.format(my_median(data)))

from numpy.random import seed

from numpy.random import randn

import numpy as np

seed(1)

data = 5 * randn(10000) + 50

def my_median(data):

data = sorted(data)

size = len(data)

mid_idx = size // 2

if size % 2 == 1:

return data[mid_idx]

idx1 = mid_idx - 1

return (data[idx1] + data[mid_idx]) / 2

print('Mean: {}'.format(my_median(data)))

Jason Brownlee May 5, 2020 at 1:36 pm #

Well done, great use of modern string formatting!

Reply

Steven May 5, 2020 at 11:57 am #

Lesson #4:

from scipy.stats import pearsonr
from sklearn.datasets import load_boston
import numpy as np

boston = load_boston()

def correlation_of_toy_dataset(ds, f1_name, f2_name):
    f1_idx, f2_idx = np.where(ds.feature_names == f1_name)[0][0], np.where(ds.feature_names == f2_name)[0][0]
    corr, _ = pearsonr(ds.data[:, f1_idx], ds.data[:, f2_idx])
    return corr

print('Pearsons correlation: %.3f' % correlation_of_toy_dataset(boston, 'CRIM', 'DIS'))

from scipy.stats import pearsonr

from sklearn.datasets import load_boston

import numpy as np

boston = load_boston()

def correlation_of_toy_dataset(ds, f1_name, f2_name):

f1_idx, f2_idx = np.where(ds.feature_names == f1_name)[0][0], np.where(ds.feature_names == f2_name)[0][0]

corr, _ = pearsonr(ds.data[:, f1_idx], ds.data[:, f2_idx])

return corr

print('Pearsons correlation: %.3f' % correlation_of_toy_dataset(boston, 'CRIM', 'DIS'))

Jason Brownlee May 5, 2020 at 1:37 pm #

Great progress!

Reply
- Anjali Vijayvargiya January 27, 2021 at 5:21 pm #
  
  I have finished the Day01 course.
  and the reasons why I need to study statistics for ML are:
  1. To understand the various concepts such as Distribution of data, how it varies with the data, how the distribution changes with the data.
  2. To understand ML Concepts better.
  3. To gain knowledge about theses concepts.
  
  Reply
  - Jason Brownlee January 28, 2021 at 5:54 am #
    
    Well done.
    
    Reply

Steven May 6, 2020 at 10:50 am #

Lesson #5
Another 3 statistical hypothesis tests are:
– Z-Test;
– ANOVA; and
– Chi-Square Test.

Reply
- Jason Brownlee May 6, 2020 at 1:37 pm #
  
  Well done!
  
  Reply
Steven May 6, 2020 at 11:37 am #

Lesson #6
Two of the methods for calculating the effect size:
– Pearson r correlation; and
– Cohen’s d effect size.

Reply
- Jason Brownlee May 6, 2020 at 1:37 pm #
  
  Great work!
  
  Reply
Steven May 6, 2020 at 11:48 am #

Lesson #7:
3 other nonparametric statistical methods:
– Wilcoxon Signed-Rank Test;
– Kruskal-Wallis H Test; and
– Friedman Test.

Reply
- Jason Brownlee May 6, 2020 at 1:38 pm #
  
  Excellent.
  
  Reply
Rishi May 8, 2020 at 4:51 pm #

Lesson #2
Descriptive statistics
Central tendency
Skewness
Correlation

Inferential Statistics
Statistical significance
Confidence intervals
Hypothesis Testing

Reply
- Jason Brownlee May 9, 2020 at 6:09 am #
  
  Nice work!
  
  Reply
Amit Desai May 11, 2020 at 5:02 pm #

I am trying to learn the ML from different channel and and my finding statistics conceptual knowledge is at very low level.

1. I want to enhance my stats learning skill using this course.
2. Time matters to me a lot and so the course duration as mentioned by you matters a lot
3. Concept clarity and connecting back to real world challenges is very important and your commitment in course description brings me here..

Reply
- Jason Brownlee May 12, 2020 at 6:37 am #
  
  Thank you!
  
  Reply
Changrong May 13, 2020 at 5:11 am #

Hi Jason,

Thank you for this course focusing on statistics in ML. When you talk about calculate correlations between variables, I have two questions:

1. I understand multicollinearity damage some algorithms’ performance, like linear regression. I wonder does multicollinearity also badly influence non-linear algorithms?
2. A more practical question, when we detect some variables are highly correlated, what should we do? For each pair of correlated variables, usually which one we should consider delete? Do we have some standard to remove multicollinearity?

Thank you in advance!

Reply
- Jason Brownlee May 13, 2020 at 6:46 am #
  
  You’re welcome.
  
  Depends on the algorithm. Sometimes yes, generally, no.
  
  Try removing redundant inputs and compare model performance on raw vs transformed data. PCA is a super easy way to do this.
  
  Reply
  - Changrong May 13, 2020 at 7:03 am #
    
    Thank you for your answer Jason. To follow up your second answer: for example, by calculating person correlation coefficients, I found multiple variables are highly correlated each other, how can I determinate which one(s) are the redundant ones and keep the representative one? Also for PCA, do you mean using PCA to deduce the dimension and turn the variables to principle components? Thanks!
    
    Reply
    - Jason Brownlee May 13, 2020 at 7:43 am #
      
      Yes, I believe the common approach it to score the correlation of each variable with all others and remove a subset of the most correlated. I don’t have a worked example.
      
      Yes, PCA will create a projection of the dataset with linear dependencies removed.
      
      Reply
      - Changrong May 13, 2020 at 7:55 am #
        
        Thank you Jason, it is very helpful.
      - Jason Brownlee May 13, 2020 at 1:21 pm #
        
        You’re welcome.
Skylar May 16, 2020 at 4:44 am #

Hi Jason,

Thank you for your probability course, I found it is very useful to help me understand ML algorithms. You mentioned two metrics: log loss and Brier score, and I understand that we can use them instead of Accuracy when we output probability in the classification problem. I have two questions regarding them:

1. I wonder for classification problems, when should we output class labels (use accuracy as metric) and when should we output class probability (then use log loss and Brier score as metric)? You mentioned that the probability can provide additional nuance for the predictions, do you mean this way is better?

2. What are the differences between log loss and Brier score from the application point of view?

Thank you very much in advance!

Reply
- Jason Brownlee May 16, 2020 at 6:23 am #
  
  You’re welcome, I’m happy to hear that.
  
  Good question – the problem requirements or project goal will dictate what to predict, e.g. labels or probability. If not, it might be a fake/toy/practice problem and you can make it up.
  
  Probability is not better, it is different. It shares uncertainty which is useful in some domains and not in others.
  
  The difference is here:
  https://machinelearningmastery.com/probability-metrics-for-imbalanced-classification/
  
  Great questions!
  
  Reply
  - Skylar May 16, 2020 at 3:54 pm #
    
    Thank you Jason, very helpful like always!
    
    Reply
    - Jason Brownlee May 17, 2020 at 6:27 am #
      
      You’re welcome.
      
      Reply
  - NewDabai May 16, 2020 at 4:01 pm #
    
    Hi Jason, what does fake/toy/practice problem mean?
    
    Reply
    - Jason Brownlee May 17, 2020 at 6:28 am #
      
      Not a real problem – e.g. you are just using it to learn and there are no project stakeholders concerned with the success/failure of the project.
      
      Reply
Satya May 22, 2020 at 7:23 pm #

I want to learn data science so for that statistics is an important pillar or part to be an expert with

Reply
- Jason Brownlee May 23, 2020 at 6:17 am #
  
  Thanks!
  
  Reply
omkar May 28, 2020 at 3:31 am #

Lesson 1:
1. I am getting a good vibe and understanding of ML. Want to explore it properly
2. Stats is what i feel is very much imp from job perspective also
3. Your platform has helped me several times and will also help me in better understanding the
future concepts of stats

Reply
- Jason Brownlee May 28, 2020 at 6:19 am #
  
  Thanks.
  
  Reply
Narendran May 29, 2020 at 12:04 pm #

Lesson 1

1. There had been number of statistical formulas in data pre-processing and for building models and evaluation. Such formulas are spread across everywhere through out data mining and machine learning that pushed me to look into statistics and take this mini-course.
2. I like to understand and measure data distribution as each kind of distribution changes the nature of the problem we handle. I hope statistics will help to quantify and measure few interesting features of distributions.
3. Like to go in depth on statistic understand them better.

Reply
- Jason Brownlee May 29, 2020 at 1:24 pm #
  
  Nice work!
  
  Reply
Narendran May 29, 2020 at 12:32 pm #

Lesson 2

1. Descriptive Methods:
Mean, Median, Mode, Range, Frequency describing the shape , center and spread.

2. Inferential Methods:
Hypothesis testing, t-test, ANOVA, F-test, Correlation (chi-square)

Reply
- Jason Brownlee May 29, 2020 at 1:24 pm #
  
  Great work.
  
  Reply
Sana June 2, 2020 at 4:03 am #

I want to learn statistics because,
1. I’ve recently gained interest in Data Science and statistics seems to be a big part
2. it will help me understand and implement the correct ML models
3. it will make me more confident, knowing the dataset in its entirety

Reply
- Jason Brownlee June 2, 2020 at 6:21 am #
  
  Thanks!
  
  Reply
Sana June 3, 2020 at 4:57 am #

Descriptive Statistics is used to summarize data and represent it with a single value. Some common descriptive statistics tools are -> mean, standard deviation and variance.

Inferential Statistics is used to study the data and reach a conclusion. Methods that help in obtaining inferences are -> correlation, hypothesis testing (Z, t, F tests), ANOVA

Reply
- Jason Brownlee June 3, 2020 at 8:04 am #
  
  Nice work!
  
  Reply
Sana June 4, 2020 at 12:14 am #

#Day 3!

import numpy as np
from numpy.random import seed
from numpy.random import randn

seed(1)
def calc_mean(data):
return sum(data)/len(data)

data_set = 5 * randn(10000) + 50
data_mean = calc_mean(data_set)
print(“%.4f” % data_mean)

#Also it is very commendable how you reply to every single comment. It’s very kind of you.

Reply
- Jason Brownlee June 4, 2020 at 6:23 am #
  
  Well done!
  
  Reply
Sana June 5, 2020 at 5:41 am #

#I didn’t know what standard dataset meant so I picked up the Titanic Survival dataset on
#Kaggel

import pandas as pd
import numpy as np
from scipy.stats import pearsonr

data_set = pd.read_csv(“train.csv”)

print(data_set.head())

survived = data_set[‘Survived’] #value represents whether the passenger survived the
#sinking of Titanic. This is the target variable

pclass = data_set[‘Pclass’] #the class of ticket bought
sibsp = data_set[‘SibSp’] #number of siblings
parch = data_set[‘Parch’] #number of parental figures

corr_coeff, p = pearsonr(survived, pclass)
print(“Correation between Survived and Pclass: %.4f” % corr_coeff)

corr_coeff, p = pearsonr(survived, sibsp)
print(“Correation between Survived and sibsp: %.4f” % corr_coeff)

corr_coeff, p = pearsonr(survived, parch)
print(“Correation between Survived and parch: %.4f” % corr_coeff)

#Only survived and parch had a positive correlation, 0.0816

Reply
- Jason Brownlee June 5, 2020 at 8:24 am #
  
  Well done!
  
  Reply
Sana June 6, 2020 at 1:14 am #

#Day 5

Other tests for hypothesis testing:

1. Z-test : Similar to the t-test but used when sample size is greater than 30
2. Chi-square test : It is used to perform hypothesis testing on categorical data
3. ANOVA : If we are comparing more than 2 means/sample parameters, ANOVA is used

Reply
- Jason Brownlee June 6, 2020 at 7:54 am #
  
  Well done!
  
  Reply
Sana June 9, 2020 at 5:10 am #

Going through you very helpful article about estimation statistics and calculating effect size, methods to find effect size are,

1. Calculating Pearson’s coefficient of correlation
2. Cohen’s d

Reply
- Jason Brownlee June 9, 2020 at 6:07 am #
  
  Nice!
  
  Reply
Sana June 10, 2020 at 3:38 am #

Final day, day 7!

Nonparametric statistical methods can be divided into two categories,

1. Calculating correlation based on ranks: Spearman’s correlation coefficient; Kendall’s correlation coefficient
2. Comparing sample means: Mann-Whitney’s U test; Kruskal-Wallis H test

Jason, I just want to thank you a lot for this course. I’m gonna keep building on this and become a great data scientist. Thanks a lot.

Reply
- Jason Brownlee June 10, 2020 at 6:21 am #
  
  Well done!
  
  Reply
BN June 16, 2020 at 8:24 pm #

Lesson 2:
descriptive statistic: mean, median, variance, histogram, scatter-plot
inferential statistic: significance, hypothesis testing, confidence interval, clustering

Reply
- Jason Brownlee June 17, 2020 at 6:22 am #
  
  Well done!
  
  Reply
Julian June 17, 2020 at 6:04 am #

Hi Jason
I’m learning so much with your blog. Answering the lesson2. Descriptive methods are: mean, mode, Standard deviation.
Inferential methods are: Hypothesis tests, confidence interval, regression analysis.
Is it correct? Thanks and Regards

Reply
- Jason Brownlee June 17, 2020 at 6:29 am #
  
  Nice!
  
  Reply
BN June 18, 2020 at 1:12 am #

Hi Jason,

Lesson 3:
“from scratch”

# 17.06.2020/na
# without error handling!
import numpy as np
zahlen = [float(element) for element in
input(“Type the values (comma delimited):”).split(“,”)]
print(“Values :”,zahlen)
print(“Mean :”,np.mean(zahlen))
print(“Variance:”,np.var(zahlen))
mean_s = np.sum(zahlen)/len(zahlen)
print(“Mean from scratch :”, mean_s )
var_s = np.sum((zahlen – mean_s)**2)/len(zahlen)
print(“Variance from scratch:”, var_s)

Thanks
Béla

Reply
- Jason Brownlee June 18, 2020 at 6:28 am #
  
  Well done!
  
  Reply
Charles June 24, 2020 at 8:27 pm #

Hello Jason – Thanks for your efforts.
Day 1 – 3 reasons why this Course on Statistics
1. I feel that to do a good job at Data Analysis – Statistics is a must
2. While I am confident on the rest of the stuff – Statistics is my weak point. Need to improve. The problem is I have read boring books on Statistics – with the Mathematics Wiz in mind. I am looking at some thing that is Crisp, to the point and ML Focussed.
3. I feel you are doing a good job based on my reviews and hence want to give this a shot!.

Regards,

Reply
- Jason Brownlee June 25, 2020 at 6:14 am #
  
  Well done!
  
  Reply
charles thomas June 25, 2020 at 3:50 pm #

Day 2:
Descriptive Statistics – Mean, Mode, Variance
Inferential Statistics – z score, Regression, T Tests

Reply
- Jason Brownlee June 26, 2020 at 5:29 am #
  
  Nice work!
  
  Reply
Malik June 25, 2020 at 7:46 pm #

1) I want to learn ML and for ML statistic is important.
2) I am a BI developer and I want to upgrade my skill.
3) for solving business ML problem, So I want to learn Statistics.

Reply
- Jason Brownlee June 26, 2020 at 5:31 am #
  
  Thanks!
  
  Reply
charles thomas July 1, 2020 at 7:37 pm #

data [58.12172682 46.94121793 47.35914124 … 44.92928092 49.68651887
42.81065054]
Mean: 50.049
Variance: 24.939
Standard Deviation: 4.994

This is what I got out of Lesson 3 (Delayed. Catching up). Thanks.

Reply
- Jason Brownlee July 2, 2020 at 6:19 am #
  
  Nice work!
  
  Reply
charles thomas July 1, 2020 at 8:04 pm #

For Day 4 got this
Pearsons correlation: 0.888

Reply
- Jason Brownlee July 2, 2020 at 6:19 am #
  
  Well done.
  
  Reply
charles thomas July 2, 2020 at 7:28 pm #

Day 5:
• One Sample Z Test
• TI 83
• Chi square test

Hypothesis Testing – other methods

Reply
- Jason Brownlee July 3, 2020 at 6:13 am #
  
  Well done.
  
  Reply
BN July 6, 2020 at 8:45 pm #

Hi Jason, Day 4:

from pandas import read_csv
# load dataset
dataset = read_csv(‘pollution.csv’, header=0, index_col=0)
# correlation Pearson
ccc = dataset[[‘pollution’,’wnd_spd’,’press’,’temp’,’dew’]].corr(method=’pearson’)
print(‘ccc:’,ccc)

ccc: pollution wnd_spd press temp dew
pollution 1.000000 -0.234362 -0.045544 -0.090798 0.157585
wnd_spd -0.234362 1.000000 0.185380 -0.154902 -0.296720
press -0.045544 0.185380 1.000000 -0.827205 -0.778737
temp -0.090798 -0.154902 -0.827205 1.000000 0.824432
dew 0.157585 -0.296720 -0.778737 0.824432 1.000000

Reply
- Jason Brownlee July 7, 2020 at 6:34 am #
  
  Well done!
  
  Reply
BN July 7, 2020 at 6:56 pm #

Hi Jason, Day 5:

other statistical hypothesis tests:

Dean&Dixon Q-Test
Grubbs’s Test (outliers)
F-Test (variance)
Wilcoxon-Test
Kolmogorow-Smirnow-Test
Chi-Square-Test

Reply
- Jason Brownlee July 8, 2020 at 6:29 am #
  
  Great work!
  
  Reply
sadam khan July 10, 2020 at 2:31 am #

its very nice book
very effective
in machine learning beginner

Reply
- Jason Brownlee July 10, 2020 at 6:05 am #
  
  Thank you!
  
  Reply
BN July 10, 2020 at 6:20 am #

Hi Jason, Day 6:

effect size:

Correlation between two variables (Pearson r)

Differnce between two means (Cohen’s d)

Reply
- Jason Brownlee July 10, 2020 at 1:40 pm #
  
  Nice work!
  
  Reply
BN July 11, 2020 at 12:59 am #

Hi Jason, Day 7:

Nonparametric:

Median Test
Skew Test
Levene Test

Reply
- Jason Brownlee July 11, 2020 at 6:18 am #
  
  Nice work!
  
  Reply
SUPRIYA July 15, 2020 at 1:06 am #

Hi Sir, Day 1
1. I want to learn ML deeply so for me statistics is important.
2. In dealing with big data, to gain insights i think statistics plays an important role.
3. Also in data science and data analytics, statistics is more important I think.

Thank you.

Reply
- Jason Brownlee July 15, 2020 at 8:26 am #
  
  Thanks!
  
  Reply
SUPRIYA July 15, 2020 at 2:22 am #

Hi Sir, Day 2

For Descriptive statistics – Mean, Median and Mode
For Inferential statistics – Confidence interval, T-test and Linear regression analysis

Thank you

Reply
- Jason Brownlee July 15, 2020 at 8:29 am #
  
  Nice work!
  
  Reply
Zach July 17, 2020 at 3:05 am #

My reasons to learn statistics

Visualization and exploratory analysis. I want to choose the best tools to clearly describe my conclusions visually to a universal audience.

I want to ensure my data is perfectly prepared for my intended model. I need normalization techniques, feature engineering and more statistical methods!

I also want to learn more about sampling techniques and uses because this has a vast field of application.

Reply
- Jason Brownlee July 17, 2020 at 6:23 am #
  
  Thanks!
  
  Reply
Zach July 17, 2020 at 3:28 am #

Descriptive
Mean, correlation, standard deviation

Inferential
T test, Z-score, regression analysis

Reply
- Jason Brownlee July 17, 2020 at 6:24 am #
  
  Well done!
  
  Reply

Zach July 17, 2020 at 3:51 am #

def smean(data):
    """Calculates the mean of a 1D data sample"""
    if len(data) > 0:
        sum = 0
        for point in data:
            sum += point
        mean = sum / len(data)
        return mean
    else:
        print('Missing data input')

def svar(data):
    """Calculates the variance of a 1D data sample"""
    mean = smean(data)
    numerator = 0
    for point in data:
        numerator += (point - mean)**2
    denominator = len(data) - 1
    var = numerator / denominator
    return var

def sstd(data):
    """Calculates the standard deviation of a 1D data sample"""
    return (svar(data))**(1/2)

def smean(data):

"""Calculates the mean of a 1D data sample"""

if len(data) > 0:

sum = 0

for point in data:

sum += point

mean = sum / len(data)

return mean

else:

print('Missing data input')

def svar(data):

"""Calculates the variance of a 1D data sample"""

mean = smean(data)

numerator = 0

for point in data:

numerator += (point - mean)**2

denominator = len(data) - 1

var = numerator / denominator

return var

def sstd(data):

"""Calculates the standard deviation of a 1D data sample"""

return (svar(data))**(1/2)

Jason Brownlee July 17, 2020 at 6:24 am #

Great work!

Reply

SUPRIYA July 18, 2020 at 12:32 am #

hello sir, Day 5

1. F-Test
2. Analysis of Variance
3. Chi-square Test

Thank you.

Reply
- Jason Brownlee July 18, 2020 at 6:03 am #
  
  Well done.
  
  Reply
Zach Brown July 18, 2020 at 7:54 am #

Pearson Correlation Coefficient

## Real world example
import pandas as pd
from scipy.stats import pearsonr
import matplotlib.pyplot as plt

# Load data
covid_data = pd.read_csv(‘us-counties.csv’)

# Explore the raw data
print(covid_data.head())
print(“\nColumns:”, len(covid_data.columns))
print(“Rows:”, len(covid_data))
print(‘\n’, covid_data.describe())

# Calculate Pearson’s correlation
corr, p = pearsonr(covid_data[‘cases’], covid_data[‘deaths’])
print(“\nPearson’s correlation:”, corr)

# Plot cases vs deaths
fig_covid, ax_covid = plt.subplots()
ax_covid.plot(covid_data[‘cases’], covid_data[‘deaths’], ‘r.’)

Reply
- Jason Brownlee July 18, 2020 at 1:12 pm #
  
  Well done!
  
  Reply
supriya July 21, 2020 at 3:43 am #

Hi sir, Day 6
1. Pearson’s correlation coefficient
2. Cohen’s d
Thank you

Reply
- Jason Brownlee July 21, 2020 at 6:11 am #
  
  Well done!
  
  Reply
SUPRIYA July 22, 2020 at 3:13 am #

Hi Sir, Day 7,

1. Spearman’s rank-order correlation
2. Wilcoxon Signed-Rank Test
3. Chi-Square Test

Thank you.

Reply
- Jason Brownlee July 22, 2020 at 5:44 am #
  
  Well done!
  
  Reply
Zach Brown July 23, 2020 at 4:36 am #

Day 5, Statistical Hypothesis Tests

Pearson’s Correlation Coefficient
Analysis of Variance
D’Agostino’s K^2 Test

Reply
- Jason Brownlee July 23, 2020 at 6:26 am #
  
  Nice work!
  
  Reply
Zach Brown July 23, 2020 at 5:31 am #

Edit on my last comment

Remove Pearson’s Correlation Coefficient from the list –

Add
Chi-squared Test

Reply
Zach Brown July 23, 2020 at 9:20 am #

Day 6, Estimation Statistics

R^2, Coefficient of Determination. Useful in machine learning as a performance metric. R^2 value close to zero indicates poor model performance, and R^2 value close to one indicates good performance.

Cohen’s d. Useful in explaining the different about the mean of two normally distributed datasets. It is phrased in terms of the standard deviation.

Reply
- Jason Brownlee July 23, 2020 at 2:41 pm #
  
  Well done!
  
  Reply
Zach Brown July 23, 2020 at 9:40 am #

Day 7, Nonparametric Statistics

Quantile regression
Kruskal-Wallis
Friedman test

Reply
- Jason Brownlee July 23, 2020 at 2:41 pm #
  
  Great work!
  
  Reply
Amal Kanti Seal August 22, 2020 at 11:41 pm #

Hi Jason,

1. Statistics in Data Preparation
2. Statistics in Model Evaluation
3. Statistics in Model Selection
4. Statistics in Model Presentation
5. Statistics in Prediction

I want to learn all the above five techniques. Actually, one of my Ph. D friends in US is working in some projects on Computational Biology (e.g. on Cancer Research and COVID-19). He has sound knowledge of Mathematics as he is a Ph.D in Physics. But he is not very comfortable in Programming and ML. So he asked me if I can help him in data analysis and prediction.

I, on the other hand, have proficiency in programming (C, C++, Java and basic Python). I am learning ML which, I think, requires good skill of linear algebra, multivariate calculus and statistics. I learned these maths during my 3-year degree course in college during 1968-1971. So I have to refresh that maths skill, particularly with reference to ML.

Looking forward to get guidance from you.

Amal

Reply
- Jason Brownlee August 23, 2020 at 6:27 am #
  
  Well done!
  
  That sounds great. I’m here to help if you have any questions.
  
  Reply
ROHAN August 31, 2020 at 1:29 pm #

lesson 1
Reasons I want to learn statistics:

1. I am new to ML techniques and algorithms and they are either fully borrowed from or heavily rely on statistics.
2. It will surely help me brush up my skills in statistics.
3. It has applications in other fields also, so fair deal to learn.

Reply
- Jason Brownlee September 1, 2020 at 6:24 am #
  
  Thanks!
  
  Reply
Ashish Soni September 4, 2020 at 1:05 am #

Hello Jason

Task for Lesson 1:

1. To get a deeper understanding the working of Machine Learning techniques.

2. Gain more knowledge.

3. To understand when to use which statistical test and why, during data analysis pipeline.

Reply
- Jason Brownlee September 4, 2020 at 6:31 am #
  
  Well done!
  
  Reply
Ashish Soni September 11, 2020 at 9:49 am #

Day 2: Introduction to Statistics

Descriptive Statistics: Mean , Variance , Median

Inferential Statistics: ANOVA, chi-square and t-test.

Reply
- Jason Brownlee September 11, 2020 at 1:31 pm #
  
  Well done!
  
  Reply
Bahar September 11, 2020 at 9:25 pm #

Thank you and again thank you, for such useful environment for people who are interested and want to learn more in details in this field.

3 reason ‘Why I am interested in this course’:

I am a AI researcher and working on different projects with real world data. We receive data.
1-We have to see should we use regression or classification? Should we use deep learning?
2- Are our samples size enough? Or what is minimum sample size in our case?
3-Then we have to select the best model. So I need to compare different standard model (e.g. regression models). Apply cross_val_score and compare their MAE,MSE,RMSE.

Then there comes some issues such as if my samples size is 12 then I cannot use ‘r2’ score (because 12 is an small size). In such case I want to know if/how can I solve sample size problem? What details and points should I consider in order to find the best model.
And what are statistics that helps me to choose the best way of resembling for my problem.

seems more than 3 reasons;)

Thanks in advance

Reply
- Jason Brownlee September 12, 2020 at 6:12 am #
  
  Thanks for sharing!
  
  Reply
Ct September 21, 2020 at 7:50 pm #

Dear Jason

Why I am Interested in this learning.

1. To get a deeper understanding and get a brief explanation on machine learning statistical test.

2. To understand how each algorithm work in predictive analytics

3. To understand how to select the best model and validate the model.

Reply
- Jason Brownlee September 22, 2020 at 6:44 am #
  
  Thanks!
  
  Reply
Musaab Mohamed October 17, 2020 at 11:36 am #

Day 1:

I am always working with data within my field of specialty:

I want to learn statistics for:

1. Prepare, validate and describe the data for analysis and modeling.
2. Checking the difference of the results.
3. Building a prediction model and variability of the results.

Reply
- Jason Brownlee October 17, 2020 at 1:44 pm #
  
  Thanks!
  
  Reply
Rajesh Babu Movva October 18, 2020 at 4:41 pm #

1. To understand, ML based on estimation and Probabilities. This encourage me to learn statistic.
2. Statistics give me insight for better understanding data.
3. ML solve the real problem in the world, and in real problems are based on Statistic.

Reply
- Jason Brownlee October 19, 2020 at 6:37 am #
  
  Nice work!
  
  Reply
L Bramwell November 13, 2020 at 10:12 pm #

Lesson #1

1. To understand how to decide if an algorithm beats the current gold standard.
2. To help me learn to use machine learning approaches and understand how to test them.
3. Practice at programming!

Reply
- Jason Brownlee November 14, 2020 at 6:33 am #
  
  Well done!
  
  Reply
L Bramwell November 18, 2020 at 11:50 pm #

Lesson #2

Descriptive statistics:
a) Mean
b) Standard deviation
c) Standard Error

Inferential statistics methods:
a) Z score
b) logistic regression
c) T tests

Reply
- Jason Brownlee November 19, 2020 at 7:45 am #
  
  Great work.
  
  Reply
Archana Saxena November 19, 2020 at 10:02 pm #

3 reasons:
1. Interpretation of charts is just not possible without learning these facts
2. Model selection based on input data is difficult
3. Not able to proceed in Machine Learning

Reply
- Jason Brownlee November 20, 2020 at 6:45 am #
  
  Well done!
  
  Reply
John Reynolds November 20, 2020 at 4:35 am #

Lesson #1

1. To find out why “Lies, damned lies, and statistics” is inaccurate(https://en.wikipedia.org/wiki/Lies,_damned_lies,_and_statistics);
2. To attempt try to understand how precision can be brought to imprecision;
3. For Joy.

Reply
- Jason Brownlee November 20, 2020 at 6:47 am #
  
  Nice!
  
  Reply
John Reynolds November 21, 2020 at 8:10 am #

#Lesson 2

Descriptive Stats:
1. Measures of central tendency – Mode, Mean, Median
2. Graphical methods, Histograms, Boxplots, Scatter Diagrams
3. Measures of variability and data spread

Inferential Stats:
1. Determine a method from inferring from a sample to a population
2. Hypothesis Testing
3. Parameter estimation

Reply
- Jason Brownlee November 21, 2020 at 1:03 pm #
  
  Well done.
  
  Reply
John Reynolds November 23, 2020 at 6:07 am #

Lesson #3

import numpy as np

np.random.seed(29)
sample = np.random.randint(100, size=1000)

mean = sum(sample)/len(sample)
var = sum((x-mean)**2 for x in sample)/len(sample)

print( f’mean={mean}, variance={var}’)
print( f’np.mean={np.mean(sample)}, np.variance={np.var(sample)}’)

Reply
- Jason Brownlee November 23, 2020 at 6:18 am #
  
  Nice work!
  
  Reply
John Reynolds November 23, 2020 at 6:33 am #

#Lesson 04

import pandas as pd

wine_df = pd.read_csv(‘winequality-white.csv’, sep=’;’)
wine_df.corr(method=’pearson’)

Reply
- Jason Brownlee November 23, 2020 at 7:31 am #
  
  Well done!
  
  Reply
John Reynolds November 27, 2020 at 12:59 am #

#Lesson 5

1. Shapiro-Wilk Test – Variable Distribution Type Tests (Gaussian)
2. Chi-Squared Test – Variable Relationship Tests (correlation)
3. Mann-Whitney U Test – Compare Sample Means (nonparametric)

Reply
- Jason Brownlee November 27, 2020 at 6:42 am #
  
  Nice work!
  
  Reply
John Reynolds December 1, 2020 at 4:15 am #

#Lesson 6

Effect size is a statistic that measures the strength of the relationship between two variables on a numeric scale.

1. Pearson r correlation:
2. Standardized means difference

Reply
- Jason Brownlee December 1, 2020 at 6:21 am #
  
  Well done!
  
  Reply
John Reynolds December 1, 2020 at 4:23 am #

#Lesson 7

1. Anderson–Darling test
2. Cochran’s Q

Reply
- Jason Brownlee December 1, 2020 at 6:21 am #
  
  Great work!
  
  Reply
Viswitha Kalamalla December 9, 2020 at 1:35 pm #

Data preparation
Model evaluation
Model selection

Reply
- Jason Brownlee December 10, 2020 at 6:17 am #
  
  Nice work.
  
  Reply
L Bramwell December 12, 2020 at 3:40 am #

Lesson 3

Used the wine dataset too. Code below…

from numpy import mean
import pandas as pd

df = pd.read_csv(“wine.csv”)
print(mean(df[“Alcohol”]))

Reply
- Jason Brownlee December 12, 2020 at 6:31 am #
  
  Well done!
  
  Reply
Qi Yunlong January 3, 2021 at 2:50 pm #

Hello Hason,

I have read a bunch of your articles on machine learning.
They are a great help to me in both understanding the basic concepts and implementing ML experiments.

Here I would like to list names of 5 non-parametric tests:
1. Wilcoxon signed-rank test
2. Kruskal-Wallis H-test
3. Ansari-Bradley test
4. Bartlett’s test
5. Mood’s two-sample test

Reply
- Jason Brownlee January 4, 2021 at 6:02 am #
  
  Great work!
  
  Reply
Shamen Paris January 17, 2021 at 10:17 pm #

I like to know the concept of the ML before start it. I think this lesson will help me to achieve this

Reply
- Jason Brownlee January 18, 2021 at 6:06 am #
  
  Thanks.
  
  Reply
Bahram Khazra March 4, 2021 at 2:35 am #

Hello Jason,
Thank you for the fascinating course.
Day1:
1- One of human’s innate desires is to take control of his/her environment. This will not be possible without the knowledge of statistics. I think the most reliable tool to understand nature is statistics.
2- Statistics provide me with a pipeline to translate my understanding of biology to a practical model.
3- Realizing the vital role of data collection and predictive models (e.g. global warming), I’ve found statistics the most essential skill to have.
4- I think statistics and logic are the birds of same feather and logical reasoning is a technic that everyone uses in daily life.

Reply
- Jason Brownlee March 4, 2021 at 5:51 am #
  
  Well done!
  
  Reply
Ivan Khurudzhi March 4, 2021 at 5:55 am #

Lesson #1

1. To understand ML
2. To advance mathematical knowledge
3. To become a better data engineer

Reply
- Jason Brownlee March 4, 2021 at 5:59 am #
  
  Great work!
  
  Reply
Nafy Aidara March 26, 2021 at 4:34 am #

I need statistics for the following results
1. Understand the data am collecting
2. Am enroll on a phd program and i have to build a model and do some predictions
3. Become a data scientist

Reply
- Jason Brownlee March 26, 2021 at 6:28 am #
  
  Well done!
  
  Reply
Nafy Aidara March 26, 2021 at 5:04 am #

Descriptive Statistics:
1.Visualizing the data through Histogram, bar chart,
2. Calculation of frequencies
3. Evaluation of samples and population means
Inferential
1. Testing hypothesis
2. Analyzing samples
3. Parameter estimation

Reply
- Jason Brownlee March 26, 2021 at 6:28 am #
  
  Great work!
  
  Reply
Haris Joseph Nitish April 14, 2021 at 7:07 pm #

Hi Jason,

Three reasons why I want to learn statistics

1. It is the very basic necessity to understand Machine Learning.
2. Your way of teaching is precise and unique which evokes more interest in Machine Learning.
3. I am moving from Database to Data Science, and I need this as this is the corner stone.

Thank you.

Reply
- Jason Brownlee April 15, 2021 at 5:24 am #
  
  Well done!
  
  Reply
Haris Joseph Nitish April 14, 2021 at 7:12 pm #

Lesson 02: Introduction to Statistics

1. Methods in Descriptive Statistics
a) the distribution
b) the central tendency
c) the dispersion
2. Methods in Inferential Statistics
a) Hypothesis Testing
b) Confidence Intervals
c) Comparison of Means

Reply
- Jason Brownlee April 15, 2021 at 5:24 am #
  
  Nice work!
  
  Reply
Haris Joseph Nitish April 15, 2021 at 1:15 am #

Lesson 03: Gaussian Distribution and Descriptive Stats:

from numpy.random import randn
from numpy import mean
data = randn(10000)
ave=sum(data)/len(data)
print(‘Mean: %.10f’ % ave)

Reply
- Jason Brownlee April 15, 2021 at 5:30 am #
  
  Well done.
  
  Reply
Haris Joseph Nitish April 16, 2021 at 2:58 am #

Lesson 04: Correlation Between Variables

Method1: Using pearson correlation on the dataset
from pandas import read_csv
from scipy.stats import pearsonr
# load dataset
url = ‘https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv’
housingDf = read_csv(url, header=None)
housingDf.corr(method=’pearson’)

If we want to build from scratch:
l=housingDf.shape[1]
Matrix = [[0 for x in range(l)] for y in range(l)]
for i in range(0,l):
for j in range(0,l):
data1=housingDf[i]
data2=housingDf[j]
corr, p = pearsonr(data1, data2)
Matrix[i][j] = “{:.6f}”.format(corr)

Matrix

Reply
- Jason Brownlee April 16, 2021 at 5:33 am #
  
  Well done.
  
  Reply
Haris Joseph Nitish April 19, 2021 at 5:37 pm #

Analysis of variance Test( ANOVA): To check if the means of two or more groups are significantly different from each other
One sample t-test: The mean of a single group is compared with a given mean.
Paired T-Test: Test for difference between two variables from the same population

Reply
- Jason Brownlee April 20, 2021 at 5:55 am #
  
  Nice work!
  
  Reply
Haris Joseph Nitish April 20, 2021 at 12:02 am #

Lesson 06: Estimation Statistics

The two methods for calculating the effect size in applied machine
learning are
1. Association. The degree to which two samples change together.
2. Difference. The degree to which two samples are different.
https://machinelearningmastery.com/estimation-statistics-for-machine-learning/

Reply
- Jason Brownlee April 20, 2021 at 6:00 am #
  
  Well done.
  
  Reply
Haris Joseph Nitish May 4, 2021 at 6:40 pm #

Lesson 07: Nonparametric Statistics

Three additional non parametric statistical methods are:
1. Wilcoxon Signed-Rank Test.
2. Kruskal-Wallis H Test.
3. Friedman Test.

Reply
- Jason Brownlee May 5, 2021 at 6:09 am #
  
  Well done!
  
  Reply
Mike Heittz June 22, 2021 at 10:08 pm #

I am starting on lesson 1 and here are my responses to 3 reasons why I personally want to learn statistics:

1. I want to re-learn old skills utilizing new technology and techniques – I learned statistics in college in the 80’s-90’s using SAS, but have forgotten so much!

2. I want to understand how to build a good ML Model with a strong understanding of the underlying statistical methods that MAKE it a good model.

3. I want to be able to explain the methodology used for building a good ML Model.

Reply
- Jason Brownlee June 23, 2021 at 5:37 am #
  
  Well done!
  
  Reply
Mike Heitz June 22, 2021 at 10:31 pm #

Responses for lesson 2:

1. Descriptive Statistics : Central Tendency (Mean, Median, Mode)
Spread (Standard Deviation, Range, Variance)

2. Inferential Statistics : T-Tests, Regression analysis (linear, logistic), ANOVA

Reply
- Jason Brownlee June 23, 2021 at 5:37 am #
  
  Great work!
  
  Reply
Mike Heitz June 24, 2021 at 10:11 pm #

Lesson 3: Learned something new today – I usually use the .format method but never used the features to round the values 🙂

# seed the random number generator
seed(10)

# generate univariate observations
data = 5 * randn(10000) + 50
print(data)

# calculate statistics
print(‘Mean: %.3f’ % mean(data))
print(‘Variance: %.3f’ % var(data))
print(‘Standard Deviation: %.3f’ % std(data))

print(‘Mean: {:.3f}’.format(mean(data)))
print(‘Variance: {:.3f}’.format(var(data)))
print(‘Standard Deviation: {:.3f}’.format(std(data)))

Reply
- Jason Brownlee June 25, 2021 at 6:14 am #
  
  Well done.
  
  Reply
Pankaj T November 19, 2021 at 1:37 pm #

Lesson 01:
Three reasons I personally want to learn statistics.
1. Learning statistics is beginning and preparing data driven mindset and improving analytical skill.
2. To explore the data quality such as Data Integrity, Data accuracy and correlation of variables. This is important activity before any investment made for Data driven project.
3. Statistics foundation allow telling stories with numbers, graph and visual diagrams, easy to digest, understand and absorb in any strategic business discussion.

Reply
- Adrian Tam November 20, 2021 at 1:46 am #
  
  Good answer Pankaj. Especially point 3. Good statistics can really help telling a good story.
  
  Reply
Pankaj November 20, 2021 at 4:03 pm #

Lesson 2:
Three methods that can be used for descriptive :
1. Continuous data central tendency
2. Summary tables
3. Visual graphs- Bar chart, Box plot, histogram.

Three methods that can be used for inferential statistics:
1. I wonder if Feature selection method such as Chi-squared (Thanks for your blog)
2. Point estimation
3. Interval estimation

Reply
Ahmad Hossein Zadeh November 30, 2021 at 9:52 am #

Lesson 1:

Reasons for why learning statistics:

1. Using statistical methods, I can start talking with the data and understand the data, trends, and hidden (statistical) features in data.
2. I will be more confident to learn and implement machine learning.

Reply
Ahmad Hossein Zadeh December 2, 2021 at 5:17 am #

Lesson 2:

Some methods/tools in descriptive statistics: distribution (patterns and trends in dataset) , central tendency (median, mode, mean, variability (skewness, standard deviation, kurtosis, min/max values, etc.)

Some methods/tools in inferential statistics used for analyzing the data sample randomly samples from a bigger population: hypothesis testing, confidence intervals, regression and correlation analysis.

Reply
AjayKS December 13, 2021 at 7:14 pm #

I would like to learn statistics as
1. Statistics forms the basis of understanding data, which is essential to building any machine learning model
2. Statistical measures are must to evaluate performance of a machine learning model
3. Machine learning models in my opinion exploit the underlying data distribution and hence fundamentals of statistics is must for the entire lifecycle of Machine learning from data Analysis to Model Development and finally evaluation

Reply
Mahmud Rahman December 22, 2021 at 9:02 pm #

The main three reasons are
1. Statistics is one of the mathematical foundations in ML.
2. Statistics is necessary for prediction with new data.
3.Statistics is all about big data. That is realm we are now approaching.

Reply
- James Carmichael December 24, 2021 at 5:06 am #
  
  Thank you for your interest and feedback Mahmud! What are some areas and applications you are interested in learning more about in machine learning?
  
  Regards,
  
  Reply
  - Mahmud Rahman December 24, 2021 at 11:51 am #
    
    Thanks for your comments. I am mostly interested ( at the moment !) in applying ML/DL to
    – Finance
    – Renewable energy
    – Pandemic
    Cheers!
    
    Reply
    - James Carmichael December 26, 2021 at 8:06 am #
      
      You are very welcome Mahmud! Thank you for your feedback!
      
      Regards,
      
      Reply
Mahmud Rahman December 22, 2021 at 10:46 pm #

Q2 response:

Three methods for descriptive statistics: the frequency distribution, the central tendency and
the dispersion.

For inferential statistics methods, three of them are hypothesis tests, confidence intervals, and regression analysis.

Reply
- James Carmichael December 24, 2021 at 4:57 am #
  
  Thank you for the feedback Mahmud! What type of applications are you working on? We offer a great deal of content that can help get you started quickly on your machine learning projects.
  
  Regards,
  
  Reply
Mahmud Rahman December 23, 2021 at 12:04 pm #

Day 3 task:

Code:

# calculate mean
from numpy.random import seed
from numpy.random import randn
from numpy import mean
# seed the random number generator
seed(1)
# generate univariate observations
data=5*randn(10000)+50
# calculate mean
print(‘Mean: %.3f’ % mean(data))

Output:

Mean: 50.049

Reply
- James Carmichael February 18, 2022 at 1:07 pm #
  
  Thank you for the feedback Mahmud! Keep up the great work!
  
  Reply
Mahmud Rahman December 23, 2021 at 7:07 pm #

Day 4 task
Code:
# calculate correlation coefficient
from numpy.random import seed
from numpy.random import randn
from scipy.stats import pearsonr
# seed random number generator
seed(1)
# prepare data
data1 = 40 * randn(2000) + 200
data2 = data1 + (20 * randn(2000) + 100)
# calculate Pearson’s correlation
corr, p = pearsonr(data1, data2)
# display the correlation
print(‘Pearsons correlation: %.3f’ % corr)

Output:

Pearsons correlation: 0.896

Reply
- James Carmichael December 24, 2021 at 5:11 am #
  
  Thank you for your feedback Mahmud! Do you have any questions regarding the lessons or in regard to the output you have received from execution of examples provided?
  
  Regards,
  
  Reply
Mahmud Rahman December 24, 2021 at 12:13 pm #

Day 5 task:
– two sample z-test: testing for the difference between proportion
-paired t-test
– ANOVA

Reply
Mahmud Rahman December 24, 2021 at 4:36 pm #

Day 6 task:

Calculation for Association Effect Size:
Pearson’s correlation coefficient (it is also called Pearson’s r). This value measures the degree of linear association between two real-valued variables.

Calculation for Difference Effect Size:
Cohen’s d which measures the difference between the mean from two Gaussian-distributed variables.

Reply
Mahmud Rahman December 24, 2021 at 5:23 pm #

Day 7:
Three additional nonparametric statistical methods are:
– Kendall rank correlation coefficient, commonly referred to as Kendall’s τ coefficient
– Spearman’s rank correlation coefficient or Spearman’s ρ
– Siegel–Tukey test

Reply
Dror de Hartog January 10, 2022 at 3:45 am #

Hi jason,
1. I am currently working as a Data Analyst but have no background in statistics. I noticed that this lack of knowledge effects my abilty to perform accurate date analysis.
2. I am looking to get inside the field of machine learning and believe that acquiring basic knowledge in statistics is a crucial initial step in that direction.
3. I love math and eager to get knowledge in a field which has direct practical qualities.

Reply
- James Carmichael January 10, 2022 at 11:09 am #
  
  Thank you for the feedback Dror! Our materials are designed to get you up to speed the quickest way possible, without then need for extensive prior knowledge of theory.
  
  Reply
Merve Zeynep March 15, 2022 at 2:10 am #

Hi Jason,
1- I have a mini project about Telco Fraud Detection, so I thought that I could start from the statistics. So I could get some information about data to detect the fraud with using machine learning algorithms.
2- Starting with statistics is important because before applying the machine learning algorithms I must learn how to apply the optimal statistics to get meaningful insight from the data. (for example detecting outliers is important in fraud cases, fraudsters generally have some anomalies and with this approach I could interpret from their behaviours.)
3-To learn Machine Learning is the one of my first goals to apply in the real problems, to solve these problems, I must learn the way how to do this and then integrate it to our data.

Reply
- James Carmichael March 15, 2022 at 1:43 pm #
  
  Great feedback Merve! Let us know if you have any specific questions regarding our content/code listings that we may assist you with.
  
  Reply
Prasanna Vadana March 16, 2022 at 2:52 am #

Lesson 1
1. Interested to learn Statistics.
2. For a machine learning enthusiast, Statistics is essential.
3. Statistics gives a different perspective to narrate about the story of the data under consideration.

Reply
- James Carmichael March 16, 2022 at 10:37 am #
  
  Great feedback Prasanna! I wish you the best on your machine learning journey!
  
  Reply
Prasanna Vadana March 16, 2022 at 10:02 am #

Lesson 2:
For Descriptive Statistics:
1. Measure of central tendency : Mean, median and mode
2. Measure of dispersion : variance, standard deviation
3. Measure of position: percentile and quartile ranks

For inferential statistics:
1. Correlation : pearson correlation, Spearman’s correlation, Chi-square
2. Regression : linear, logistic
3. Hypothesis

Reply
- James Carmichael March 16, 2022 at 10:34 am #
  
  Thank you for the support and feedback Prasanna! Keep up the great work!
  
  Reply
vzs April 11, 2022 at 7:40 pm #

For Lesson 1:

– I’m interested in Statistics,
– I would like to get a deeper understanding of this field,
– I would like to use models based on AI in engineering.

Thank you!

Reply
- James Carmichael April 14, 2022 at 3:26 am #
  
  Thank you for the feedback Vzs!
  
  Reply
Mario Bibiano April 22, 2022 at 11:11 pm #

Statistics is the foundation of Machine Learning, great material James!!!!

Reply
- James Carmichael April 24, 2022 at 3:33 am #
  
  You are very welcome Mario! I wish you much success on your machine learning journey!
  
  Reply
Azibatasebh Peter Oghe October 30, 2022 at 4:52 pm #

Lesson 1:
1. I want to learn statistics so I can be able to use the right data analytics techniques in machine learning.

2. To understand how to properly visualize and interpret data.

3. To understand how to make proper predictions with my final model.

Reply
- James Carmichael October 31, 2022 at 7:46 am #
  
  Thank you for the feedback Azibatasebh! We wish you much success on your machine learning journey!
  
  Reply
yasu November 5, 2022 at 9:38 am #

Learning data analysis techniques seemed to make it more enjoyable.

I think learning statistics will be a good base for data analysis.

I think learning theory will help me understand it better.

Reply
- James Carmichael November 6, 2022 at 11:41 am #
  
  Thank you for your feedback yasu! We greatly appreciate it.
  
  Reply
Wesley December 14, 2022 at 3:59 am #

I want to be able to trust the methods I choose to solve I think sometimes it looks like a model is performing well or its outputs make sense but actually we may have violated some underlying assumption that makes the model untrustworthy. I want to be a trustworthy data scientist!

Reply
- James Carmichael December 14, 2022 at 9:24 am #
  
  Hi Wesley…The following resource may be of interest:
  
  https://arxiv.org/abs/2102.00902
  
  Reply
Wesley December 14, 2022 at 4:06 am #

Lesson 2:

Descriptive statistics:

1. Visualization of the distribution of a given population or sample
2. Using some summary statistic such as an average value to share findings with stakeholders who need to make business decisions.

Inferential statistics:
1. Outlier detection using confidence intervals
2. Linear regression, or better yet maybe analyzing the distribution of the error from a linear regression model to infer whether the relationship is actually linear.

Reply
- James Carmichael December 14, 2022 at 9:18 am #
  
  Thank you for your feedback Wesley!
  
  Reply
Jorge Arranz May 23, 2023 at 2:01 am #

Hello, I’ve just started the course!!

Lesson 1:

I want to learn stadistics becouse 3 main reasons:

– Improve how to study experiment results (in general, not only in ML and DL)
– Learn how to use the stadistics in ML and DL in particular
– Improve my stadistical capacitities in general (understand the Stadistics, not just copy formulas and apply them)

Reply
- James Carmichael May 23, 2023 at 6:10 am #
  
  Great feedback Jorge! Let us know if we can address any questions as you work through the content.
  
  Reply
Jorge Arranz May 23, 2023 at 2:15 am #

Lesson 2:

Descriptive stadistics: mean and median (central tendency) and standard deviation

Inferencial stadistics: Fisher test and ANOVA (hypothesis contrast) and regression analysis

Reply
Jorge Arranz May 23, 2023 at 2:34 am #

Lesson 3:

´´´
import numpy as np
import matplotlib.pyplot as plt

# Generating random values collection
np.random.seed(98)
values = np.random.randn(1000)

# Obtaining the sum of values
total_value = 0
for value in values:
total_value += value

# Obtaining the number of elements in collection
n_values = values.shape[0]

# Calculating the mean
mean = total_value / n_values
print(f’Mean: {mean}’)

# Displaying collection histogram
plt.hist(values)
´´´

If the above text has not been formatted to code, my apologies

Reply
g July 8, 2023 at 7:00 pm #

Lesson 2:-

For Descriptive Statistics: Mean, Mode, Range, standard Deviation, Variance,..

For Inferential Statistics: ANOVA, student-t , exponential, sample test,..

Reply
- James Carmichael July 9, 2023 at 7:07 am #
  
  Keep up the great work g! Let us know if can help answer any questions from this course or any of our ebooks.
  
  https://machinelearningmastery.com/products/
  
  Reply
River July 10, 2023 at 1:32 pm #

Hi,

Lesson 2

Descriptive Statistics: Maximum, Minimum, and average;
Inferential Statistics: linear regression, random forest, and Z-test.

Reply
- James Carmichael July 10, 2023 at 11:09 pm #
  
  Thank you for your feedback River! Keep up the great work and let us know if we can help answer any questions you may have.
  
  Reply

Navigation

Statistics for Machine Learning (7-Day Mini-Course)

Statistics for Machine Learning Crash Course.

Get on top of the statistics used in machine learning in 7 Days.

Who Is This Crash-Course For?

Crash-Course Overview

Need help with Statistics for Machine Learning?

Lesson 01: Statistics and Machine Learning

1. Statistics in Data Preparation

2. Statistics in Model Evaluation

3. Statistics in Model Selection

4. Statistics in Model Presentation

5. Statistics in Prediction

Your Task

Lesson 02: Introduction to Statistics

Your Task

Lesson 03: Gaussian Distribution and Descriptive Stats

Your Task

Lesson 04: Correlation Between Variables

Your Task

Lesson 05: Statistical Hypothesis Tests

Your Task

Lesson 06: Estimation Statistics

Your Task

Lesson 07: Nonparametric Statistics

Your Task

The End!
(Look How Far You Have Come)

Summary

Get a Handle on Statistics for Machine Learning!

Develop a working understanding of statistics

Discover how to Transform Data into Knowledge

More On This Topic

328 Responses to Statistics for Machine Learning (7-Day Mini-Course)

Leave a Reply Click here to cancel reply.

Navigation

Statistics for Machine Learning Crash Course.

Get on top of the statistics used in machine learning in 7 Days.

Who Is This Crash-Course For?

Crash-Course Overview

Need help with Statistics for Machine Learning?

Lesson 01: Statistics and Machine Learning

1. Statistics in Data Preparation

2. Statistics in Model Evaluation

3. Statistics in Model Selection

4. Statistics in Model Presentation

5. Statistics in Prediction

Your Task

Lesson 02: Introduction to Statistics

Your Task

Lesson 03: Gaussian Distribution and Descriptive Stats

Your Task

Lesson 04: Correlation Between Variables

Your Task

Lesson 05: Statistical Hypothesis Tests

Your Task

Lesson 06: Estimation Statistics

Your Task

Lesson 07: Nonparametric Statistics

Your Task

The End! (Look How Far You Have Come)

Summary

Get a Handle on Statistics for Machine Learning!

Develop a working understanding of statistics

Discover how to Transform Data into Knowledge

More On This Topic

328 Responses to Statistics for Machine Learning (7-Day Mini-Course)

Leave a Reply Click here to cancel reply.

The End!
(Look How Far You Have Come)