There may be complex and unknown relationships between the variables in your dataset. It is important to discover and quantify the degree to which variables in your dataset are dependent upon each other. This knowledge can help you better prepare your data to meet the expectations of machine learning algorithms, such as linear regression, whose […]

# Archive | Statistics

## How to Use Statistics to Identify Outliers in Data

When modeling, it is important to clean the data sample to ensure that the observations best represent the problem. Sometimes a dataset can contain extreme values that are outside the range of what is expected and unlike the other data. These are called outliers and often machine learning modeling and model skill in general can […]

## Introduction to Random Number Generators for Machine Learning in Python

Randomness is a big part of machine learning. Randomness is used as a tool or a feature in preparing data and in learning algorithms that map input data to output data in order to make predictions. In order to understand the need for statistical methods in machine learning, you must understand the source of randomness […]

## How to Calculate Bootstrap Confidence Intervals For Machine Learning Results in Python

It is important to both present the expected skill of a machine learning model a well as confidence intervals for that model skill. Confidence intervals provide a range of model skills and a likelihood that the model skill will fall between the ranges when making predictions on new data. For example, a 95% likelihood of […]

## How to Report Classifier Performance with Confidence Intervals

Once you choose a machine learning algorithm for your classification problem, you need to report the performance of the model to stakeholders. This is important so that you can set the expectations for the model on new data. A common mistake is to report the classification accuracy of the model alone. In this post, you […]

## How to Use Statistical Significance Tests to Interpret Machine Learning Results

It is good practice to gather a population of results when comparing two different machine learning algorithms or when comparing the same algorithm with different configurations. Repeating each experimental run 30 or more times gives you a population of results from which you can calculate the mean expected performance, given the stochastic nature of most […]

## Estimate the Number of Experiment Repeats for Stochastic Machine Learning Algorithms

A problem with many stochastic machine learning algorithms is that different runs of the same algorithm on the same data return different results. This means that when performing experiments to configure a stochastic algorithm or compare algorithms, you must collect multiple results and use the average performance to summarize the skill of the model. This […]

## Machine Learning Terminology from Statistics and Computer Science

Data plays a big part in machine learning. It is important to understand and use the right terminology when talking about data. In this post you will discover exactly how to describe and talk about data in machine learning. After reading this post you will know the terminology and nomenclature used in machine learning to describe […]

## Crash Course in Statistics for Machine Learning

You do not need to know statistics before you can start learning and applying machine learning. You can start today. Nevertheless, knowing some statistics can be very helpful to understand the language used in machine learning. Knowing some statistics will eventually be required when you want to start making strong claims about your results. In […]