We have an intuition that more observations is better. This is the same intuition behind the idea that if we collect more data, our sample of data will be more representative of the problem domain. There is a theorem in statistics and probability that supports this intuition that is a pillar of both of these […]
Archive | Statistics
A Gentle Introduction to Calculating Normal Summary Statistics
A sample of data is a snapshot from a broader population of all possible observations that could be taken of a domain or generated by a process. Interestingly, many observations fit a common pattern or distribution called the normal distribution, or more formally, the Gaussian distribution. A lot is known about the Gaussian distribution, and […]
How to Calculate Correlation Between Variables in Python
Ever looked at your data and thought something was missing or it’s hiding something from you? This is a deep dive guide on revealing those hidden connections and unknown relationships between the variables in your dataset. Why should you care? Machine learning algorithms like linear regression hate surprises. It is essential to discover and quantify […]
Introduction to Random Number Generators for Machine Learning in Python
Randomness is a big part of machine learning. Randomness is used as a tool or a feature in preparing data and in learning algorithms that map input data to output data in order to make predictions. In order to understand the need for statistical methods in machine learning, you must understand the source of randomness […]
How to Calculate Bootstrap Confidence Intervals For Machine Learning Results in Python
It is important to both present the expected skill of a machine learning model a well as confidence intervals for that model skill. Confidence intervals provide a range of model skills and a likelihood that the model skill will fall between the ranges when making predictions on new data. For example, a 95% likelihood of […]
How to Report Classifier Performance with Confidence Intervals
Once you choose a machine learning algorithm for your classification problem, you need to report the performance of the model to stakeholders. This is important so that you can set the expectations for the model on new data. A common mistake is to report the classification accuracy of the model alone. In this post, you […]
How to Use Statistical Significance Tests to Interpret Machine Learning Results
It is good practice to gather a population of results when comparing two different machine learning algorithms or when comparing the same algorithm with different configurations. Repeating each experimental run 30 or more times gives you a population of results from which you can calculate the mean expected performance, given the stochastic nature of most […]
Estimate the Number of Experiment Repeats for Stochastic Machine Learning Algorithms
A problem with many stochastic machine learning algorithms is that different runs of the same algorithm on the same data return different results. This means that when performing experiments to configure a stochastic algorithm or compare algorithms, you must collect multiple results and use the average performance to summarize the skill of the model. This […]
Machine Learning Terminology from Statistics and Computer Science
Data plays a big part in machine learning. It is important to understand and use the right terminology when talking about data. In this post you will discover exactly how to describe and talk about data in machine learning. After reading this post you will know the terminology and nomenclature used in machine learning to describe […]
Crash Course in Statistics for Machine Learning
You do not need to know statistics before you can start learning and applying machine learning. You can start today. Nevertheless, knowing some statistics can be very helpful to understand the language used in machine learning. Knowing some statistics will eventually be required when you want to start making strong claims about your results. In […]