Archive | Statistics

A Gentle Introduction to the Law of Large Numbers in Machine Learning

By Jason Brownlee on August 8, 2019 in Statistics 10

We have an intuition that more observations is better. This is the same intuition behind the idea that if we collect more data, our sample of data will be more representative of the problem domain. There is a theorem in statistics and probability that supports this intuition that is a pillar of both of these […]

Line plot of Gaussian distributions with low and high variance

A Gentle Introduction to Calculating Normal Summary Statistics

By Jason Brownlee on August 8, 2019 in Statistics 21

A sample of data is a snapshot from a broader population of all possible observations that could be taken of a domain or generated by a process. Interestingly, many observations fit a common pattern or distribution called the normal distribution, or more formally, the Gaussian distribution. A lot is known about the Gaussian distribution, and […]

Scatter plot of the test correlation dataset

How to Calculate Correlation Between Variables in Python

By Jason Brownlee on November 17, 2023 in Statistics 129

Ever looked at your data and thought something was missing or it’s hiding something from you? This is a deep dive guide on revealing those hidden connections and unknown relationships between the variables in your dataset. Why should you care? Machine learning algorithms like linear regression hate surprises. It is essential to discover and quantify […]

Introduction to Random Number Generators for Machine Learning in Python

By Jason Brownlee on July 31, 2020 in Statistics 23

Randomness is a big part of machine learning. Randomness is used as a tool or a feature in preparing data and in learning algorithms that map input data to output data in order to make predictions. In order to understand the need for statistical methods in machine learning, you must understand the source of randomness […]

How to Calculate Bootstrap Confidence Intervals For Machine Learning Results in Python

By Jason Brownlee on August 14, 2020 in Statistics 145

It is important to both present the expected skill of a machine learning model a well as confidence intervals for that model skill. Confidence intervals provide a range of model skills and a likelihood that the model skill will fall between the ranges when making predictions on new data. For example, a 95% likelihood of […]

How to Report Classifier Performance with Confidence Intervals

By Jason Brownlee on August 14, 2020 in Statistics 86

Once you choose a machine learning algorithm for your classification problem, you need to report the performance of the model to stakeholders. This is important so that you can set the expectations for the model on new data. A common mistake is to report the classification accuracy of the model alone. In this post, you […]

How to Use Statistical Significance Tests to Interpret Machine Learning Results

By Jason Brownlee on August 8, 2019 in Statistics 37

It is good practice to gather a population of results when comparing two different machine learning algorithms or when comparing the same algorithm with different configurations. Repeating each experimental run 30 or more times gives you a population of results from which you can calculate the mean expected performance, given the stochastic nature of most […]

Zoomed Line Plot of Mean Result with Standard Error Bars and Population Mean

Estimate the Number of Experiment Repeats for Stochastic Machine Learning Algorithms

By Jason Brownlee on August 14, 2020 in Statistics 28

A problem with many stochastic machine learning algorithms is that different runs of the same algorithm on the same data return different results. This means that when performing experiments to configure a stochastic algorithm or compare algorithms, you must collect multiple results and use the average performance to summarize the skill of the model. This […]

How To Talk About Data in Machine Learning

Machine Learning Terminology from Statistics and Computer Science

By Jason Brownlee on August 8, 2019 in Statistics 6

Data plays a big part in machine learning. It is important to understand and use the right terminology when talking about data. In this post you will discover exactly how to describe and talk about data in machine learning. After reading this post you will know the terminology and nomenclature used in machine learning to describe […]

Crash Course in Statistics for Machine Learning

By Jason Brownlee on August 15, 2020 in Statistics 3

You do not need to know statistics before you can start learning and applying machine learning. You can start today. Nevertheless, knowing some statistics can be very helpful to understand the language used in machine learning. Knowing some statistics will eventually be required when you want to start making strong claims about your results. In […]

← Previous 1 … 4 5