The post Controlled Experiments in Machine Learning appeared first on Machine Learning Mastery.

]]>Given the complexity of machine learning methods, they resist formal analysis methods. Therefore, we must learn about the behavior of algorithms on our specific problems empirically. We do this using controlled experiments.

In this tutorial, you will discover the important role that controlled experiments play in applied machine learning.

After completing this tutorial, you will know:

- The need for systematic discovery via controlled experiments.
- The need to repeat experiments in order to control for the sources of variance.
- Examples of experiments performed in machine learning and the challenge and opportunity they represent.

Let’s get started.

This tutorial is divided into 3 parts; they are:

- Systematic Experimentation
- Controlling For Variance
- Experiments in Machine Learning

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

In applied machine learning, you must become the scientist and perform systematic experiments.

The answers to questions that you care about, such as what algorithm works best on your data or which input features to use, can only be found through the results of experimental trials.

This is due mainly to the fact that machine learning methods are complex and resist formal methods of analysis.

[…] many learning algorithms are too complex for formal analysis, at least at the level of generality assumed by most theoretical treatments. As a result, empirical studies of the behavior of machine learning algorithms must retain a central role.

— The Experimental Study of Machine Learning, 1991.

In statistics, the choice of a type of experiment is called experimental design, and there are many types of experiments to choose from. For example, you may have heard that randomized double-blind placebo-controlled experimentation as the gold standard for evaluating the effectiveness of medical treatments.

Applied machine learning is special in that we have complete control over the experiment and we can run as few or as many trials as we wish on our computer. Because of the ease of running experiments, it is important that we are running the right types of experiments.

In the natural sciences, one can never control all possible variables. […] As a science of the artificial, machine learning can usually avoid such complications.

— Machine Learning as an Experimental Science, Editorial, 1998.

The type of experiments we wish to perform are called controlled experiments.

These are experiments where all known independent variables are held constant and modified one at a time in order to determine their impact on the dependent variable. The results are compared to a baseline, or no-treatment, called a “*control*.” This could be the result of a baseline method like persistence or the Zero Rule algorithm or the default-configuration for the method.

As normally defined, an experiment involves systematically varying one or more independent variables and examining their effect on some dependent variables. Thus, a machine learning experiment requires more than a single learning run; it requires a number of runs carried out under different conditions. In each case, one must measure some aspect of the system’s behavior for comparison across the different conditions.

— Machine Learning as an Experimental Science, Editorial, 1998.

In many ways, experiments with machine learning methods have more in common with simulation studies, such as those in physics, than with evaluating medical treatments.

As such, the results of a single experiment are probabilistic, subjected to variance.

There are two main types of variance that we seek to understand in our controlled experiments; they are:

**Variance in the data**, such as the data used to train the learning algorithm and the data used to evaluate its skill.**Variance in the model**, such as the use of randomness in the learning algorithm, such as random initial weights in neural nets, selection of cut points in bagging, shuffled order of data in stochastic gradient descent, and so on.

A result from a single run or trial of a controlled experiment would be misleading given these sources of variance.

The experiment must control for these sources of variance. This is done by repeating the experimental trial multiple times in order to elicit the range of variance so that we can both report the expected result and the variance in the expected result, e.g. mean and confidence interval.

In simulation studies, such as Monte Carlo methods, the repetition of an experiment is called variance reduction.

Experimentation is a key part of applied machine learning.

This is both a challenge to beginners who must learn some rigor and an exciting opportunity for discovery and contribution.

Let’s make this concrete with some examples of the types of controlled experiments you may need to perform:

**Choose-Features Experiments**. When determining what data features (input variables) are most relevant to a model, the independent variables may be the input features and the dependent variable might be the estimated skill of the model on unseen data.**Tune-Model Experiments**. When tuning a machine learning model, the independent variables may be the hyperparameters of the learning algorithm and the dependent variable might be the estimated skill of the model on unseen data.**Compare-Models Experiments**. When comparing the performance of machine learning models, the independent variables may be the learning algorithms themselves with a specific configuration and the dependent variable is the estimated skill of the model on unseen data.

What makes the experimental focus of applied machine learning so exciting is two fold:

**Discovery**. You can discover what works best for your specific problem and data. A challenge and an opportunity.**Contribution**. You can make broader discoveries in the field, without any specialized knowledge other than rigorous and systematic experimentation.

Using off-the-shelf tools and careful experimental methods, you can make discoveries and contributions.

In summary machine learning occupies a fortunate position that makes systematic experimentation easy and profitable. […] Although experimental studies are not the only path to understanding, we feel they constitute one of machine learning s brightest hopes for rapid scientific progress, and we encourage other researchers to join in our fields evolution toward an experimental science.

— The Experimental Study of Machine Learning, 1991.

This section provides more resources on the topic if you are looking to go deeper.

- The Design and Analysis of Computer Experiments, 2003.
- Empirical Methods for Artificial Intelligence, 1995.

- Machine Learning as an Experimental Science, Editorial, 1998.
- The Experimental Study of Machine Learning, 1991.
- Machine Learning as an Experimental Science (Revisited), 2006.

- Scientific control on Wikipedia
- Design of experiments on Wikipedia
- Blinded experiment on Wikipedia
- Controlling for a variable on Wikipedia
- Computer experiment on Wikipedia
- Variance Reduction on Wikipedia

In this tutorial, you discovered the important role that controlled experiments play in applied machine learning.

Specifically, you learned:

- The need for systematic discovery via controlled experiments.
- The need to repeat experiments in order to control for the sources of variance.
- Examples of experiments performed in machine learning and the challenge and opportunity they represent.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Controlled Experiments in Machine Learning appeared first on Machine Learning Mastery.

]]>The post Statistical Significance Tests for Comparing Machine Learning Algorithms appeared first on Machine Learning Mastery.

]]>Models are commonly evaluated using resampling methods like k-fold cross-validation from which mean skill scores are calculated and compared directly. Although simple, this approach can be misleading as it is hard to know whether the difference between mean skill scores is real or the result of a statistical fluke.

Statistical significance tests are designed to address this problem and quantify the likelihood of the samples of skill scores being observed given the assumption that they were drawn from the same distribution. If this assumption, or null hypothesis, is rejected, it suggests that the difference in skill scores is statistically significant.

Although not foolproof, statistical hypothesis testing can improve both your confidence in the interpretation and the presentation of results during model selection.

In this tutorial, you will discover the importance and the challenge of selecting a statistical hypothesis test for comparing machine learning models.

After completing this tutorial, you will know:

- Statistical hypothesis tests can aid in comparing machine learning models and choosing a final model.
- The naive application of statistical hypothesis tests can lead to misleading results.
- Correct use of statistical tests is challenging, and there is some consensus for using the McNemar’s test or 5×2 cross-validation with a modified paired Student t-test.

Let’s get started.

This tutorial is divided into 5 parts; they are:

- The Problem of Model Selection
- Statistical Hypothesis Tests
- Problem of Choosing a Hypothesis Test
- Summary of Some Findings
- Recommendations

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

A big part of applied machine learning is model selection.

We can describe this in its simplest form:

Given the evaluation of two machine learning methods on a dataset, which model do you choose?

You choose the model with the best skill.

That is, the model whose estimated skill when making predictions on unseen data is best. This might be maximum accuracy or minimum error in the case of classification and regression problems respectively.

The challenge with selecting the model with the best skill is determining how much can you trust the estimated skill of each model. More generally:

Is the difference in skill between two machine learning models real, or due to a statistical chance?

We can use statistical hypothesis testing to address this question.

Generally, a statistical hypothesis test for comparing samples quantifies how likely it is to observe two data samples given the assumption that the samples have the same distribution.

The assumption of a statistical test is called the null hypothesis and we can calculate statistical measures and interpret them in order to decide whether or not to accept or reject the null hypothesis.

In the case of selecting models based on their estimated skill, we are interested to know whether there is a real or statistically significant difference between the two models.

- If the result of the test suggests that there is insufficient evidence to reject the null hypothesis, then any observed difference in model skill is likely due to statistical chance.
- If the result of the test suggests that there is sufficient evidence to reject the null hypothesis, then any observed difference in model skill is likely due to a difference in the models.

The results of the test are probabilistic, meaning, it is possible to correctly interpret the result and for the result to be wrong with a type I or type II error. Briefly, a false positive or false negative finding.

Comparing machine learning models via statistical significance tests imposes some expectations that in turn will impact the types of statistical tests that can be used; for example:

**Skill Estimate**. A specific measure of model skill must be chosen. This could be classification accuracy (a proportion) or mean absolute error (summary statistic) which will limit the type of tests that can be used.**Repeated Estimates**. A sample of skill scores is required in order to calculate statistics. The repeated training and testing of a given model on the same or different data will impact the type of test that can be used.**Distribution of Estimates**. The sample of skill score estimates will have a distribution, perhaps Gaussian or perhaps not. This will determine whether parametric or nonparametric tests can be used.**Central Tendency**. Model skill will often be described and compared using a summary statistic such as a mean or median, depending on the distribution of skill scores. The test may or may not take this directly into account.

The results of a statistical test are often a test statistic and a p-value, both of which can be interpreted and used in the presentation of the results in order to quantify the level of confidence or significance in the difference between models. This allows stronger claims to be made as part of model selection than not using statistical hypothesis tests.

Given that using statistical hypothesis tests seems desirable as part of model selection, how do you choose a test that is suitable for your specific use case?

Let’s look at a common example for evaluating and comparing classifiers for a balanced binary classification problem.

It is common practice to evaluate classification methods using classification accuracy, to evaluate each model using 10-fold cross-validation, to assume a Gaussian distribution for the sample of 10 model skill estimates, and to use the mean of the sample as a summary of the model’s skill.

We could require that each classifier evaluated using this procedure be evaluated on exactly the same splits of the dataset via 10-fold cross-validation. This would give samples of matched paired measures between two classifiers, matched because each classifier was evaluated on the same 10 test sets.

We could then select and use the paired Student’s t-test to check if the difference in the mean accuracy between the two models is statistically significant, e.g. reject the null hypothesis that assumes that the two samples have the same distribution.

In fact, this is a common way to compare classifiers with perhaps hundreds of published papers using this methodology.

The problem is, a key assumption of the paired Student’s t-test has been violated.

Namely, the observations in each sample are not independent. As part of the k-fold cross-validation procedure, a given observation will be used in the training dataset (k-1) times. This means that the estimated skill scores are dependent, not independent, and in turn that the calculation of the t-statistic in the test will be misleadingly wrong along with any interpretations of the statistic and p-value.

This observation requires a careful understanding of both the resampling method used, in this case k-fold cross-validation, and the expectations of the chosen hypothesis test, in this case the paired Student’s t-test. Without this background, the test appears appropriate, a result will be calculated and interpreted, and everything will look fine.

Unfortunately, selecting an appropriate statistical hypothesis test for model selection in applied machine learning is more challenging than it first appears. Fortunately, there is a growing body of research helping to point out the flaws of the naive approaches, and suggesting corrections and alternate methods.

In this section, let’s take a look at some of the research into the selection of appropriate statistical significance tests for model selection in machine learning.

Perhaps the seminal work on this topic is the 1998 paper titled “Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms” by Thomas Dietterich.

It’s an excellent paper on the topic and a recommended read. It covers first a great framework for thinking about the points during a machine learning project where a statistical hypothesis test may be required, discusses the expectation on common violations of statistical tests relevant to comparing classifier machine learning methods, and finishes with an empirical evaluation of methods to confirm the findings.

This article reviews five approximate statistical tests for determining whether one learning algorithm outperforms another on a particular learning task.

The focus of the selection and empirical evaluation of statistical hypothesis tests in the paper is that calibration of Type I error or false positives. That is, selecting a test that minimizes the case of suggesting a significant difference when no such difference exists.

There are a number of important findings in this paper.

The first finding is that using paired Student’s t-test on the results of skill estimated via random resamples of a training dataset should never be done.

… we can confidently conclude that the resampled t test should never be employed.

The assumptions of the paired t-test are violated in the case of random resampling and in the case of k-fold cross-validation (as noted above). Nevertheless, in the case of k-fold cross-validation, the t-test will be optimistic, resulting in a higher Type I error, but only a modest Type II error. This means that this combination could be used in cases where avoiding Type II errors is more important than succumbing to a Type I error.

The 10-fold cross-validated t test has high type I error. However, it also has high power, and hence, it can be recommended in those cases where type II error (the failure to detect a real difference between algorithms) is more important.

Dietterich recommends the McNemar’s statistical hypothesis test in cases where there is a limited amount of data and each algorithm can only be evaluated once.

McNemar’s test is like the Chi-Squared test, and in this case is used to determine whether the difference in observed proportions in the algorithm’s contingency table are significantly different from the expected proportions. This is a useful finding in the case of large deep learning neural networks that can take days or weeks to train.

Our experiments lead us to recommend […] McNemar’s test, for situations where the learning algorithms can be run only once.

Dietterich also recommends a resampling method of his own devising called 5×2 cross-validation that involves 5 repeats of 2-fold cross-validation.

Two folds are chosen to ensure that each observation appears only in the train or test dataset for a single estimate of model skill. A paired Student’s t-test is used on the results, updated to better reflect the limited degrees of freedom given the dependence between the estimated skill scores.

Our experiments lead us to recommend […] 5 x 2cv t test, for situations in which the learning algorithms are efficient enough to run ten times

The use of either McNemar’s test or 5×2 cross-validation has become a staple recommendation for much of the 20 years since the paper was published.

Nevertheless, further improvements have been made to better correct the paired Student’s t-test for the violation of the independence assumption from repeated k-fold cross-validation.

Two important papers among many include:

Claude Nadeau and Yoshua Bengio propose a further correction in their 2003 paper titled “Inference for the Generalization Error“. It’s a dense paper and not recommended for the faint of heart.

This analysis allowed us to construct two variance estimates that take into account both the variability due to the choice of the training sets and the choice of the test examples. One of the proposed estimators looks similar to the cv method (Dietterich, 1998) and is specifically designed to overestimate the variance to yield conservative inference.

Remco Bouckaert and Eibe Frank in their 2004 paper titled “Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms” take a different perspective and considers the ability to replicate results as more important than Type I or Type II errors.

In this paper we argue that the replicability of a test is also of importance. We say that a test has low replicability if its outcome strongly depends on the particular random partitioning of the data that is used to perform it

Surprisingly, they recommend using either 100 runs of random resampling or 10×10-fold cross-validation with the Nadeau and Bengio correction to the paired Student-t test in order to achieve good replicability.

The latter approach is recommended in Ian Witten and Eibe Frank’s book and in their open-source data mining platform Weka, referring to the Nadeau and Bengio correction as the “*corrected resampled t-test*“.

Various modifications of the standard t-test have been proposed to circumvent this problem, all of them heuristic and lacking sound theoretical justification. One that appears to work well in practice is the corrected resampled t-test. […] The same modified statistic can be used with repeated cross-validation, which is just a special case of repeated holdout in which the individual test sets for one cross- validation do not overlap.

— Page 159, Chapter 5, Credibility: Evaluating What’s Been Learned, Data Mining: Practical Machine Learning Tools and Techniques, Third Edition, 2011.

There are no silver bullets when it comes to selecting a statistical significance test for model selection in applied machine learning.

Let’s look at five approaches that you may use on your machine learning project to compare classifiers.

If you have near unlimited data, gather k separate train and test datasets to calculate 10 truly independent skill scores for each method.

You may then correctly apply the paired Student’s t-test. This is most unlikely as we are often working with small data samples.

… the assumption that there is essentially unlimited data so that several independent datasets of the right size can be used. In practice there is usually only a single dataset of limited size. What can be done?

— Page 158, Chapter 5, Credibility: Evaluating What’s Been Learned, Data Mining: Practical Machine Learning Tools and Techniques, Third Edition, 2011.

The naive 10-fold cross-validation can be used with an unmodified paired Student t-test can be used.

It has good repeatability relative to other methods and a modest type II error, but is known to have a high type I error.

The experiments also suggest caution in interpreting the results of the 10-fold cross-validated t test. This test has an elevated probability of type I error (as much as twice the target level), although it is not nearly as severe as the problem with the resampled t test.

— Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms, 1998.

It’s an option, but it’s very weakly recommended.

The two-decade long recommendations of McNemar’s test for single-run classification accuracy results and 5×2-fold cross-validation with a modified paired Student’s t-test in general stand.

Further, the Nadeau and Bengio further correction to the test statistic may be used with the 5×2-fold cross validation or 10×10-fold cross-validation as recommended by the developers of Weka.

A challenge in using the modified t-statistic is that there is no off-the-shelf implementation (e.g. in SciPy), requiring the use of third-party code and the risks that this entails. You may have to implement it yourself.

The availability and complexity of a chosen statistical method is an important consideration, said well by Gitte Vanwinckelen and Hendrik Blockeel in their 2012 paper titled “On Estimating Model Accuracy with Repeated Cross-Validation“:

While these methods are carefully designed, and are shown to improve upon previous methods in a number of ways, they suffer from the same risk as previous methods, namely that the more complex a method is, the higher the risk that researchers will use it incorrectly, or interpret the result incorrectly.

We can use a nonparametric test that makes fewer assumptions, such as not assuming that the distribution of the skill scores is Gaussian.

One example is the Wilcoxon signed-rank test, which is the nonparametric version of the paired Student’s t-test. This test has less statistical power than the paired t-test, although more power when the expectations of the t-test are violated, such as independence.

This statistical hypothesis test is recommended for comparing algorithms different datasets by Janez Demsar in his 2006 paper “Statistical Comparisons of Classifiers over Multiple Data Sets“.

We therefore recommend using the Wilcoxon test, unless the t-test assumptions are met, either because we have many data sets or because we have reasons to believe that the measure of performance across data sets is distributed normally.

Although the test is nonparametric, it still assumes that the observations within each sample are independent (e.g. iid), and using k-fold cross-validation would create dependent samples and violate this assumption.

Instead of statistical hypothesis tests, estimation statistics can be calculated, such as confidence intervals. These would suffer from similar problems where the assumption of independence is violated given the resampling methods by which the models are evaluated.

Tom Mitchell makes a similar recommendation in his 1997 book, suggesting to take the results of statistical hypothesis tests as heuristic estimates and seek confidence intervals around estimates of model skill:

To summarize, no single procedure for comparing learning methods based on limited data satisfies all the constraints we would like. It is wise to keep in mind that statistical models rarely fit perfectly the practical constraints in testing learning algorithms when available data is limited. Nevertheless, they do provide approximate confidence intervals that can be of great help in interpreting experimental comparisons of learning methods.

— Page 150, Chapter 5, Evaluating Hypotheses, Machine Learning, 1997.

Statistical methods such as the bootstrap can be used to calculate defensible nonparametric confidence intervals that can be used to both present results and compare classifiers. This is a simple and effective approach that you can always fall back upon and that I recommend in general.

In fact confidence intervals have received the most theoretical study of any topic in the bootstrap area.

— Page 321, An Introduction to the Bootstrap, 1994.

This section lists some ideas for extending the tutorial that you may wish to explore.

- Find and list three research papers that incorrectly use the unmodified paired Student’s t-test to compare and choose a machine learning model.
- Summarize the framework for using statistical hypothesis tests in a machine learning project presented in Thomas Dietterich’s 1998 paper.
- Find and list three research papers that correctly use either the McNemar’s test or 5×2 Cross-Validation for comparison and choose a machine learning model.

If you explore any of these extensions, I’d love to know.

This section provides more resources on the topic if you are looking to go deeper.

- Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms, 1998.
- Inference for the Generalization Error, 2003.
- Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms, 2004.
- On estimating model accuracy with repeated cross-validation, 2012.
- Statistical Comparisons of Classifiers over Multiple Data Sets, 2006.

- Chapter 5, Evaluating Hypotheses, Machine Learning, 1997.
- Chapter 5, Credibility: Evaluating What’s Been Learned, Data Mining: Practical Machine Learning Tools and Techniques, Third Edition, 2011.
- An Introduction to the Bootstrap, 1994.

- Student’s t-test on Wikipedia
- Cross-validation (statistics) on Wikipedia
- McNemar’s test on Wikipedia
- Wilcoxon signed-rank test on Wikipedia

- For model selection/comparison, what kind of test should I use?
- How to perform hypothesis testing for comparing different classifiers
- Wilcoxon rank sum test methodology
- How to choose between t-test or non-parametric test e.g. Wilcoxon in small samples

In this tutorial, you discovered the importance and the challenge of selecting a statistical hypothesis test for comparing machine learning models.

Specifically, you learned:

- Statistical hypothesis tests can aid in comparing machine learning models and choosing a final model.
- The naive application of statistical hypothesis tests can lead to misleading results.
- Correct use of statistical tests is challenging, and there is some consensus for using the McNemar’s test or 5×2 cross-validation with a modified paired Student t-test.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Statistical Significance Tests for Comparing Machine Learning Algorithms appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to the Chi-Squared Test for Machine Learning appeared first on Machine Learning Mastery.

]]>This is the problem of feature selection.

In the case of classification problems where input variables are also categorical, we can use statistical tests to determine whether the output variable is dependent or independent of the input variables. If independent, then the input variable is a candidate for a feature that may be irrelevant to the problem and removed from the dataset.

The Pearson’s chi-squared statistical hypothesis is an example of a test for independence between categorical variables.

In this tutorial, you will discover the chi-squared statistical hypothesis test for quantifying the independence of pairs of categorical variables.

After completing this tutorial, you will know:

- Pairs of categorical variables can be summarized using a contingency table.
- The chi-squared test can compare an observed contingency table to an expected table and determine if the categorical variables are independent.
- How to calculate and interpret the chi-squared test for categorical variables in Python.

Let’s get started.

**Update Jun/2018**: Minor typo fix in the interpretation of the critical values from the test (thanks Andrew).

This tutorial is divided into 3 parts; they are:

- Contingency Table
- Pearson’s Chi-Squared Test
- Example Chi-Squared Test

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

A categorical variable is a variable that may take on one of a set of labels.

An example might be sex, which may be summarized as male or female. The variable is ‘*sex*‘ and the labels or factors of the variable are ‘*male*‘ and ‘*female*‘ in this case.

We may wish to look at a summary of a categorical variable as it pertains to another categorical variable. For example, sex and interest, where interest may have the labels ‘*science*‘, ‘*math*‘, or ‘*art*‘. We can collect observations from people collected with regard to these two categorical variables; for example:

Sex, Interest Male, Art Female, Math Male, Science Male, Math ...

We can summarize the collected observations in a table with one variable corresponding to columns and another variable corresponding to rows. Each cell in the table corresponds to the count or frequency of observations that correspond to the row and column categories.

Historically, a table summarization of two categorical variables in this form is called a contingency table.

For example, the *Sex=rows* and *Interest=columns* table with contrived counts might look as follows:

Science, Math, Art Male 20, 30, 15 Female 20, 15, 30

The table was called a contingency table, by Karl Pearson, because the intent is to help determine whether one variable is contingent upon or depends upon the other variable. For example, does an interest in math or science depend on gender, or are they independent?

This is challenging to determine from the table alone; instead, we can use a statistical method called the Pearson’s Chi-Squared test.

The Pearson’s Chi-Squared test, or just Chi-Squared test for short, is named for Karl Pearson, although there are variations on the test.

The Chi-Squared test is a statistical hypothesis test that assumes (the null hypothesis) that the observed frequencies for a categorical variable match the expected frequencies for the categorical variable. The test calculates a statistic that has a chi-squared distribution, named for the Greek capital letter Chi (X) pronounced “ki” as in kite.

Given the Sex/Interest example above, the number of observations for a category (such as male and female) may or may not the same. Nevertheless, we can calculate the expected frequency of observations in each Interest group and see whether the partitioning of interests by Sex results in similar or different frequencies.

The Chi-Squared test does this for a contingency table, first calculating the expected frequencies for the groups, then determining whether the division of the groups, called the observed frequencies, matches the expected frequencies.

The result of the test is a test statistic that has a chi-squared distribution and can be interpreted to reject or fail to reject the assumption or null hypothesis that the observed and expected frequencies are the same.

When observed frequency is far from the expected frequency, the corresponding term in the sum is large; when the two are close, this term is small. Large values of X^2 indicate that observed and expected frequencies are far apart. Small values of X^2 mean the opposite: observeds are close to expecteds. So X^2 does give a measure of the distance between observed and expected frequencies.

— Page 525, Statistics, Fourth Edition, 2007.

The variables are considered independent if the observed and expected frequencies are similar, that the levels of the variables do not interact, are not dependent.

The chi-square test of independence works by comparing the categorically coded data that you have collected (known as the observed frequencies) with the frequencies that you would expect to get in each cell of a table by chance alone (known as the expected frequencies).

— Page 162, Statistics in Plain English, Third Edition, 2010.

We can interpret the test statistic in the context of the chi-squared distribution with the requisite number of degress of freedom as follows:

**If Statistic >= Critical Value**: significant result, reject null hypothesis (H0), dependent.**If Statistic < Critical Value**: not significant result, fail to reject null hypothesis (H0), independent.

The degrees of freedom for the chi-squared distribution is calculated based on the size of the contingency table as:

degrees of freedom: (rows - 1) * (cols - 1)

In terms of a p-value and a chosen significance level (alpha), the test can be interpreted as follows:

**If p-value <= alpha**: significant result, reject null hypothesis (H0), dependent.**If p-value > alpha**: not significant result, fail to reject null hypothesis (H0), independent.

For the test to be effective, at least five observations are required in each cell of the contingency table.

Next, let’s look at how we can calculate the chi-squared test.

The Pearson’s chi-squared test for independence can be calculated in Python using the chi2_contingency() SciPy function.

The function takes an array as input representing the contingency table for the two categorical variables. It returns the calculated statistic and p-value for interpretation as well as the calculated degrees of freedom and table of expected frequencies.

stat, p, dof, expected = chi2_contingency(table)

We can interpret the statistic by retrieving the critical value from the chi-squared distribution for the probability and number of degrees of freedom.

For example, a probability of 95% can be used, suggesting that the finding of the test is quite likely given the assumption of the test that the variable is independent. If the statistic is less than or equal to the critical value, we can fail to reject this assumption, otherwise it can be rejected.

# interpret test-statistic prob = 0.95 critical = chi2.ppf(prob, dof) if abs(stat) >= critical: print('Dependent (reject H0)') else: print('Independent (fail to reject H0)')

We can also interpret the p-value by comparing it to a chosen significance level, which would be 5%, calculated by inverting the 95% probability used in the critical value interpretation.

# interpret p-value alpha = 1.0 - prob if p <= alpha: print('Dependent (reject H0)') else: print('Independent (fail to reject H0)')

We can tie all of this together and demonstrate the chi-squared significance test using a contrived contingency table.

A contingency table is defined below that has a different number of observations for each population (row), but a similar proportion across each group (column). Given the similar proportions, we would expect the test to find that the groups are similar and that the variables are independent (fail to reject the null hypothesis, or H0).

table = [ [10, 20, 30], [6, 9, 17]]

The complete example is listed below.

# chi-squared test with similar proportions from scipy.stats import chi2_contingency from scipy.stats import chi2 # contingency table table = [ [10, 20, 30], [6, 9, 17]] print(table) stat, p, dof, expected = chi2_contingency(table) print('dof=%d' % dof) print(expected) # interpret test-statistic prob = 0.95 critical = chi2.ppf(prob, dof) print('probability=%.3f, critical=%.3f, stat=%.3f' % (prob, critical, stat)) if abs(stat) >= critical: print('Dependent (reject H0)') else: print('Independent (fail to reject H0)') # interpret p-value alpha = 1.0 - prob print('significance=%.3f, p=%.3f' % (alpha, p)) if p <= alpha: print('Dependent (reject H0)') else: print('Independent (fail to reject H0)')

Running the example first prints the contingency table. The test is calculated and the degrees of freedom (*dof*) is reported as 2, which makes sense given:

degrees of freedom: (rows - 1) * (cols - 1) degrees of freedom: (2 - 1) * (3 - 1) degrees of freedom: 1 * 2 degrees of freedom: 2

Next, the calculated expected frequency table is printed and we can see that indeed the observed contingency table does appear to match via an eyeball check of the numbers.

The critical value is calculated and interpreted, finding that indeed the variables are independent (fail to reject H0). The interpretation of the p-value makes the same finding.

[[10, 20, 30], [6, 9, 17]] dof=2 [[10.43478261 18.91304348 30.65217391] [ 5.56521739 10.08695652 16.34782609]] probability=0.950, critical=5.991, stat=0.272 Independent (fail to reject H0) significance=0.050, p=0.873 Independent (fail to reject H0)

This section lists some ideas for extending the tutorial that you may wish to explore.

- Update the chi-squared test to use your own contingency table.
- Write a function to report on the independence given observations from two categorical variables
- Load a standard machine learning dataset containing categorical variables and report on the independence of each.

If you explore any of these extensions, I’d love to know.

This section provides more resources on the topic if you are looking to go deeper.

- Chapter 14, The Chi-Square Test of Independence, Statistics in Plain English, Third Edition, 2010.
- Chapter 28, The Chi-Square Test, Statistics, Fourth Edition, 2007.

- Chi-squared test on Wikipedia
- Pearson’s chi-squared test on Wikipedia
- Contingency table on Wikipedia
- How is chi test used for feature selection in machine learning? on Quora

In this tutorial, you discovered the chi-squared statistical hypothesis test for quantifying the independence of pairs of categorical variables.

Specifically, you learned:

- Pairs of categorical variables can be summarized using a contingency table.
- The chi-squared test can compare an observed contingency table to an expected table and determine if the categorical variables are independent.
- How to calculate and interpret the chi-squared test for categorical variables in Python.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to the Chi-Squared Test for Machine Learning appeared first on Machine Learning Mastery.

]]>The post How to Calculate the 5-Number Summary for Your Data in Python appeared first on Machine Learning Mastery.

]]>The mean and standard deviation are used to summarize data with a Gaussian distribution, but may not be meaningful, or could even be misleading, if your data sample has a non-Gaussian distribution.

In this tutorial, you will discover the five-number summary for describing the distribution of a data sample without assuming a specific data distribution.

After completing this tutorial, you will know:

- Data summarization, such as calculating the mean and standard deviation, are only meaningful for the Gaussian distribution.
- The five-number summary can be used to describe a data sample with any distribution.
- How to calculate the five-number summary in Python.

Let’s get started.

This tutorial is divided into 4 parts; they are:

- Nonparametric Data Summarization
- Five-Number Summary
- How to Calculate the Five-Number Summary
- Use of the Five-Number Summary

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Data summarization techniques provide a way to describe the distribution of data using a few key measurements.

The most common example of data summarization is the calculation of the mean and standard deviation for data that has a Gaussian distribution. With these two parameters alone, you can understand and re-create the distribution of the data. The data summary can compress as few as tens or as many as millions individual observations.

The problem is, you cannot easily calculate the mean and standard deviation of data that does not have a Gaussian distribution. Technically, you can calculate these quantities, but they do not summarize the data distribution; in fact, they can be very misleading.

In the case of data that does not have a Gaussian distribution, you can summarize the data sample using the five-number summary.

The five-number summary, or 5-number summary for short, is a non-parametric data summarization technique.

It is sometimes called the Tukey 5-number summary because it was recommended by John Tukey. It can be used to describe the distribution of data samples for data with any distribution.

As a standard summary for general use, the 5-number summary provides about the right amount of detail.

— Page 37, Understanding Robust and Exploratory Data Analysis, 2000.

The five-number summary involves the calculation of 5 summary statistical quantities: namely:

**Median**: The middle value in the sample, also called the 50th percentile or the 2nd quartile.**1st Quartile**: The 25th percentile.**3rd Quartile**: The 75th percentile.**Minimum**: The smallest observation in the sample.**Maximum**: The largest observation in the sample.

A quartile is an observed value at a point that aids in splitting the ordered data sample into four equally sized parts. The median, or 2nd Quartile, splits the ordered data sample into two parts, and the 1st and 3rd quartiles split each of those halves into quarters.

A percentile is an observed value at a point that aids in splitting the ordered data sample into 100 equally sized portions. Quartiles are often also expressed as percentiles.

Both the quartile and percentile values are examples of rank statistics that can be calculated on a data sample with any distribution. They are used to quickly summarize how much of the data in the distribution is behind or in front of a given observed value. For example, half of the observations are behind and in front of the median of a distribution.

Note that quartiles are also calculated in the box and whisker plot, a nonparametric method to graphically summarize the distribution of a data sample.

Calculating the five-number summary involves finding the observations for each quartile as well as the minimum and maximum observed values from the data sample.

If there is no specific value in the ordered data sample for the quartile, such as if there are an even number of observations and we are trying to find the median, then we can calculate the mean of the two closest values, such as the two middle values.

We can calculate arbitrary percentile values in Python using the percentile() NumPy function. We can use this function to calculate the 1st, 2nd (median), and 3rd quartile values. The function takes both an array of observations and a floating point value to specify the percentile to calculate in the range of 0 to 100. It can also takes a list of percentile values to calculate multiple percentiles; for example:

quartiles = percentile(data, [25, 50, 75])

By default, the function will calculate a linear interpolation (average) between observations if needed, such as in the case of calculating the median on a sample with an even number of values.

The NumPy functions min() and max() can be used to return the smallest and largest values in the data sample; for example:

data_min, data_max = data.min(), data.max()

We can put all of this together.

The example below generates a data sample drawn from a uniform distribution between 0 and 1 and summarizes it using the five-number summary.

# calculate a 5-number summary from numpy import percentile from numpy.random import rand # generate data sample data = rand(1000) # calculate quartiles quartiles = percentile(data, [25, 50, 75]) # calculate min/max data_min, data_max = data.min(), data.max() # print 5-number summary print('Min: %.3f' % data_min) print('Q1: %.3f' % quartiles[0]) print('Median: %.3f' % quartiles[1]) print('Q3: %.3f' % quartiles[2]) print('Max: %.3f' % data_max)

Running the example generates the data sample and calculates the five-number summary to describe the sample distribution.

We can see that the spread of observations is close to our expectations showing 0.27 for the 25th percentile 0.53 for the 50th percentile, and 0.76 for the 75th percentile, close to the idealized values of 0.25, 0.50, and 0.75 respectively.

Min: 0.000 Q1: 0.277 Median: 0.532 Q3: 0.766 Max: 1.000

The five-number summary can be calculated for a data sample with any distribution.

This includes data that has a known distribution, such as a Gaussian or Gaussian-like distribution.

I would recommend always calculating the five-number summary, and only moving on to distribution specific summaries, such as mean and standard deviation for the Gaussian, in the case that you can identify the distribution to which the data belongs.

This section lists some ideas for extending the tutorial that you may wish to explore.

- Describe three examples in a machine learning project where a five-number summary could be calculated.
- Generate a data sample with a Gaussian distribution and calculate the five-number summary.
- Write a function to calculate a 5-number summary for any data sample.

If you explore any of these extensions, I’d love to know.

This section provides more resources on the topic if you are looking to go deeper.

In this tutorial, you discovered the five-number summary for describing the distribution of a data sample without assuming a specific data distribution.

Specifically, you learned:

- Data summarization, such as calculating the mean and standard deviation, are only meaningful for the Gaussian distribution.
- The five-number summary can be used to describe a data sample with any distribution.
- How to calculate the five-number summary in Python.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Calculate the 5-Number Summary for Your Data in Python appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to Statistical Sampling and Resampling appeared first on Machine Learning Mastery.

]]>Data sampling refers to statistical methods for selecting observations from the domain with the objective of estimating a population parameter. Whereas data resampling refers to methods for economically using a collected dataset to improve the estimate of the population parameter and help to quantify the uncertainty of the estimate.

Both data sampling and data resampling are methods that are required in a predictive modeling problem.

In this tutorial, you will discover statistical sampling and statistical resampling methods for gathering and making best use of data.

After completing this tutorial, you will know:

- Sampling is an active process of gathering observations with the intent of estimating a population variable.
- Resampling is a methodology of economically using a data sample to improve the accuracy and quantify the uncertainty of a population parameter.
- Resampling methods, in fact, make use of a nested resampling method.

Let’s get started.

This tutorial is divided into 2 parts; they are:

- Statistical Sampling
- Statistical Resampling

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Each row of data represents an observation about something in the world.

When working with data, we often do not have access to all possible observations. This could be for many reasons; for example:

- It may difficult or expensive to make more observations.
- It may be challenging to gather all observations together.
- More observations are expected to be made in the future.

Observations made in a domain represent samples of some broader idealized and unknown population of all possible observations that could be made in the domain. This is a useful conceptualization as we can see the separation and relationship between observations and the idealized population.

We can also see that, even if we intend to use big data infrastructure on all available data, that the data still represents a sample of observations from an idealized population.

Nevertheless, we may wish to estimate properties of the population. We do this by using samples of observations.

Sampling consists of selecting some part of the population to observe so that one may estimate something about the whole population.

— Page 1, Sampling, Third Edition, 2012.

Statistical sampling is the process of selecting subsets of examples from a population with the objective of estimating properties of the population.

Sampling is an active process. There is a goal of estimating population properties and control over how the sampling is to occur. This control falls short of influencing the process that generates each observation, such as performing an experiment. As such, sampling as a field sits neatly between pure uncontrolled observation and controlled experimentation.

Sampling is usually distinguished from the closely related field of experimental design, in that in experiments one deliberately perturbs some part of the population in order to see what the effect of that action is. […] Sampling is also usually distinguished from observational studies, in which one has little or no control over how the observations on the population were obtained.

— Pages 1-2, Sampling, Third Edition, 2012.

There are many benefits to sampling compared to working with fuller or complete datasets, including reduced cost and greater speed.

In order to perform sampling, it requires that you carefully define your population and the method by which you will select (and possibly reject) observations to be a part of your data sample. This may very well be defined by the population parameters that you wish to estimate using the sample.

Some aspects to consider prior to collecting a data sample include:

**Sample Goal**. The population property that you wish to estimate using the sample.**Population**. The scope or domain from which observations could theoretically be made.**Selection Criteria**. The methodology that will be used to accept or reject observations in your sample.**Sample Size**. The number of observations that will constitute the sample.

Some obvious questions […] are how best to obtain the sample and make the observations and, once the sample data are in hand, how best to use them to estimate the characteristics of the whole population. Obtaining the observations involves questions of sample size, how to select the sample, what observational methods to use, and what measurements to record.

— Page 1, Sampling, Third Edition, 2012.

Statistical sampling is a large field of study, but in applied machine learning, there may be three types of sampling that you are likely to use: simple random sampling, systematic sampling, and stratified sampling.

**Simple Random Sampling**: Samples are drawn with a uniform probability from the domain.**Systematic Sampling**: Samples are drawn using a pre-specified pattern, such as at intervals.**Stratified Sampling**: Samples are drawn within pre-specified categories (i.e. strata).

Although these are the more common types of sampling that you may encounter, there are other techniques.

Sampling requires that we make a statistical inference about the population from a small set of observations.

We can generalize properties from the sample to the population. This process of estimation and generalization is much faster than working with all possible observations, but will contain errors. In many cases, we can quantify the uncertainty of our estimates and add errors bars, such as confidence intervals.

There are many ways to introduce errors into your data sample.

Two main types of errors include selection bias and sampling error.

**Selection Bias**. Caused when the method of drawing observations skews the sample in some way.**Sampling Error**. Caused due to the random nature of drawing observations skewing the sample in some way.

Other types of errors may be present, such as systematic errors in the way observations or measurements are made.

In these cases and more, the statistical properties of the sample may be different from what would be expected in the idealized population, which in turn may impact the properties of the population that are being estimated.

Simple methods, such as reviewing raw observations, summary statistics, and visualizations can help expose simple errors, such as measurement corruption and the over- or underrepresentation of a class of observations.

Nevertheless, care must be taken both when sampling and when drawing conclusions about the population while sampling.

Once we have a data sample, it can be used to estimate the population parameter.

The problem is that we only have a single estimate of the population parameter, with little idea of the variability or uncertainty in the estimate.

One way to address this is by estimating the population parameter multiple times from our data sample. This is called resampling.

Statistical resampling methods are procedures that describe how to economically use available data to estimate a population parameter. The result can be both a more accurate estimate of the parameter (such as taking the mean of the estimates) and a quantification of the uncertainty of the estimate (such as adding a confidence interval).

Resampling methods are very easy to use, requiring little mathematical knowledge. They are methods that are easy to understand and implement compared to specialized statistical methods that may require deep technical skill in order to select and interpret.

The resampling methods […] are easy to learn and easy to apply. They require no mathematics beyond introductory high-school algebra, et are applicable in an exceptionally broad range of subject areas.

— Page xiii, Resampling Methods: A Practical Guide to Data Analysis, 2005.

A downside of the methods is that they can be computationally very expensive, requiring tens, hundreds, or even thousands of resamples in order to develop a robust estimate of the population parameter.

The key idea is to resample form the original data — either directly or via a fitted model — to create replicate datasets, from which the variability of the quantiles of interest can be assessed without long-winded and error-prone analytical calculation. Because this approach involves repeating the original data analysis procedure with many replicate sets of data, these are sometimes called computer-intensive methods.

— Page 3, Bootstrap Methods and their Application, 1997.

Each new subsample from the original data sample is used to estimate the population parameter. The sample of estimated population parameters can then be considered with statistical tools in order to quantify the expected value and variance, providing measures of the uncertainty of the estimate.

Statistical sampling methods can be used in the selection of a subsample from the original sample.

A key difference is that process must be repeated multiple times. The problem with this is that there will be some relationship between the samples as observations that will be shared across multiple subsamples. This means that the subsamples and the estimated population parameters are not strictly identical and independently distributed. This has implications for statistical tests performed on the sample of estimated population parameters downstream, i.e. paired statistical tests may be required.

Two commonly used resampling methods that you may encounter are k-fold cross-validation and the bootstrap.

**Bootstrap**. Samples are drawn from the dataset with replacement (allowing the same sample to appear more than once in the sample), where those instances not drawn into the data sample may be used for the test set.**k-fold Cross-Validation**. A dataset is partitioned into k groups, where each group is given the opportunity of being used as a held out test set leaving the remaining groups as the training set.

The k-fold cross-validation method specifically lends itself to use in the evaluation of predictive models that are repeatedly trained on one subset of the data and evaluated on a second held-out subset of the data.

Generally, resampling techniques for estimating model performance operate similarly: a subset of samples are used to fit a model and the remaining samples are used to estimate the efficacy of the model. This process is repeated multiple times and the results are aggregated and summarized. The differences in techniques usually center around the method in which subsamples are chosen.

— Page 69, Applied Predictive Modeling, 2013.

The bootstrap method can be used for the same purpose, but is a more general and simpler method intended for estimating a population parameter.

This section lists some ideas for extending the tutorial that you may wish to explore.

- List two examples where statistical sampling is required in a machine learning project.
- List two examples when statistical resampling is required in a machine learning project.
- Find a paper that uses a resampling method that in turn uses a nested statistical sampling method (hint: k-fold cross validation and stratified sampling).

If you explore any of these extensions, I’d love to know.

This section provides more resources on the topic if you are looking to go deeper.

- Sampling, Third Edition, 2012.
- Sampling Techniques, 3rd Edition, 1977.
- Resampling Methods: A Practical Guide to Data Analysis, 2005.
- An Introduction to the Bootstrap, 1994.
- Bootstrap Methods and their Application, 1997.
- Applied Predictive Modeling, 2013.

- Sample (statistics) on Wikipedia
- Simple random sample on Wikipedia
- Systematic sampling on Wikipedia
- Stratified sampling on Wikipedia
- Resampling (statistics) on Wikipedia
- Bootstrapping (statistics) on Wikipedia
- Cross-validation (statistics) on Wikipedia

In this tutorial, you discovered statistical sampling and statistical resampling methods for gathering and making best use of data.

Specifically, you learned:

- Sampling is an active process of gathering observations intent on estimating a population variable.
- Resampling is a methodology of economically using a data sample to improve the accuracy and quantify the uncertainty of a population parameter.
- Resampling methods, in fact, make use of a nested resampling method.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Statistical Sampling and Resampling appeared first on Machine Learning Mastery.

]]>The post Critical Values for Statistical Hypothesis Testing and How to Calculate Them in Python appeared first on Machine Learning Mastery.

]]>Not all implementations of statistical tests return p-values. In some cases, you must use alternatives, such as critical values. In addition, critical values are used when estimating the expected intervals for observations from a population, such as in tolerance intervals.

In this tutorial, you will discover critical values, why they are important, how they are used, and how to calculate them in Python using SciPy.

After completing this tutorial, you will know:

- Examples of statistical hypothesis tests and their distributions from which critical values can be calculated and used.
- How exactly critical values are used on one-tail and two-tail statistical hypothesis tests.
- How to calculate critical values for the Gaussian, Student’s t, and Chi-Squared distributions.

Let’s get started.

This tutorial is divided into 4 parts; they are:

- Why Do We Need Critical Values?
- What Is a Critical Value?
- How to Use Critical Values
- How to Calculate Critical Values

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Many statistical hypothesis tests return a p-value that is used to interpret the outcome of the test.

Some tests do not return a p-value, requiring an alternative method for interpreting the calculated test statistic directly.

A statistic calculated by a statistical hypothesis test can be interpreted using critical values from the distribution of the test statistic.

Some examples of statistical hypothesis tests and their distributions from which critical values can be calculated are as follows:

**Z-Test**: Gaussian distribution.**Student t-Test**: Student’s t-distribution.**Chi-Squared Test**: Chi-Squared distribution.**ANOVA**: F-distribution.

Critical values are also used when defining intervals for expected (or unexpected) observations in distributions. Calculating and using critical values may be appropriate when quantifying the uncertainty of estimated statistics or intervals such as confidence intervals and tolerance intervals.

A critical value is defined in the context of the population distribution and a probability.

An observation from the population with a value equal to or lesser than a critical value with the given probability.

We can express this mathematically as follows:

Pr[X <= critical value] = probability

Where *Pr* is the calculation of probability, *X* are observations from the population, *critica_value* is the calculated critical value, and *probability* is the chosen probability.

Critical values are calculated using a mathematical function where the probability is provided as an argument. For most common distributions, the value cannot be calculated analytically; instead it must be estimated using numerical methods. Historically it is common for tables of pre-calculated critical values to be provided in the appendices of statistics textbooks for reference purposes.

Critical values are used in statistical significance testing. The probability is often expressed as a significance, denoted as the lowercase Greek letter alpha (a), which is the inverted probability.

probability = 1 - alpha

Standard alpha values are used when calculating critical values, chosen for historical reasons and continually used for consistency reasons. These alpha values include:

- 1% (alpha=0.01)
- 5% (alpha=0.05)
- 10% (alpha=0.10)

Critical values provide an alternative and equivalent way to interpret statistical hypothesis tests to the p-value.

Calculated critical values are used as a threshold for interpreting the result of a statistical test.

The observation values in the population beyond the critical value are often called the “*critical region*” or the “*region of rejection*“.

Critical Value: A value appearing in tables for specified statistical tests indicating at what computed value the null hypothesis can be rejected (the computed statistic falls in the rejection region).

— Page 265, Handbook of Research Methods: A Guide for Practitioners and Students in the Social Sciences, 2003.

A statistical test may be one-tailed or two-tailed.

A one-tailed test has a single critical value, such as on the left or the right of the distribution.

Often, a one-tailed test has a critical value on the right of the distribution for non-symmetrical distributions (such as the Chi-Squared distribution).

The statistic is compared to the calculated critical value. If the statistic is less than or equal to the critical value, the null hypothesis of the statistical test is failed to be rejected. Otherwise it is rejected.

We can summarize this interpretation as follows:

**Test Statistic < Critical Value**: Fail to reject the null hypothesis of the statistical test.**Test Statistic => Critical Value**: Reject the null hypothesis of the statistical test.

A two-tailed test has two critical values, one on each side of the distribution, which is often assumed to be symmetrical (e.g. Gaussian and Student-t distributions.).

When using a two-tailed test, a significance level (or alpha) used in the calculation of the critical values must be divided by 2. The critical value will then use a portion of this alpha on each side of the distribution.

To make this concrete, consider an alpha of 5%. This would be split to give two alpha values of 2.5% on either side of the distribution with an acceptance area in the middle of the distribution of 95%.

We can refer to each critical value as the lower and upper critical values for the left and right of the distribution respectively. Test statistic values more than or equal to the lower critical value and less than or equal to the upper critical value indicate the failure to reject the null hypothesis. Whereas test statistic values less than the lower critical value and more than the upper critical value indicate rejection of the null hypothesis for the test.

We can summarize this interpretation as follows:

**Lower CR < Test Statistic < Upper CR**: Failure to reject the null hypothesis of the statistical test.**Test Statistic <= Lower CR OR Test Statistic >= Upper CR**: Reject the null hypothesis of the statistical test.

If the distribution of the test statistic is symmetric around a mean of zero, then we can shortcut the check by comparing the absolute (positive) value of the test statistic to the upper critical value.

**|Test Statistic| < Upper Critical Value**: Failure to reject the null hypothesis of the statistical test.

Where *|Test Statistic|* is the absolute value of the calculated test statistic.

Density functions return the probability of an observation in the distribution. Recall the definitions of the PDF and CDF as follows:

**Probability Density Function (PDF)**: Returns the probability for an observation having a specific value from the distribution.**Cumulative Density Function (CDF)**: Returns the probability for an observation equal to or lesser than a specific value from the distribution.

In order to calculate a critical value, we require a function that, given a probability (or significance), will return the observation value from the distribution.

Specifically, we require the inverse of the cumulative density function, where given a probability, we are given the observation value that is less than or equal to the probability. This is called the percent point function (PPF), or more generally the quantile function.

**Percent Point Function (PPF)**: Returns the observation value for the provided probability that is less than or equal to the provided probability from the distribution.

Specifically, a value from the distribution will equal or be less than the value returned from the PPF with the specified probability.

Let’s make this concrete with three distributions from which it is commonly required to calculate critical values. Namely, the Gaussian distribution, Student’s t-distribution, and the Chi-squared distribution.

We can calculate the percent point function in SciPy using the *ppf()* function on a given distribution. It should also be noted that you can also calculate the *ppf()* using the inverse survival function called *isf()* in SciPy. This is mentioned as you may see use of this alternate approach in third party code.

The example below calculates the percent point function for 95% on the standard Gaussian distribution.

# Gaussian Percent Point Function from scipy.stats import norm # define probability p = 0.95 # retrieve value <= probability value = norm.ppf(p) print(value) # confirm with cdf p = norm.cdf(value) print(p)

Running the example first prints the value that marks 95% or less of the observations from the distribution of about 1.65. This value is then confirmed by retrieving the probability of the observation from the CDF, which returns 95%, as expected.

We can see that the value 1.65 aligns with our expectation with regard to the number of standard deviations from the mean that cover 95% of the distribution in the 68–95–99.7 rule.

1.6448536269514722 0.95

The example below calculates the percentage point function for 95% on the standard Student’s t-distribution with 10 degrees of freedom.

# Student t-distribution Percent Point Function from scipy.stats import t # define probability p = 0.95 df = 10 # retrieve value <= probability value = t.ppf(p, df) print(value) # confirm with cdf p = t.cdf(value, df) print(p)

Running the example returns the value of about 1.812 or less that covers 95% of the observations from the chosen distribution. The probability of the value is then confirmed (with minor rounding error) via the CDF.

1.8124611228107335 0.949999999999923

The example below calculates the percentage point function for 95% on the standard Chi-Squared distribution with 10 degrees of freedom.

# Chi-Squared Percent Point Function from scipy.stats import chi2 # define probability p = 0.95 df = 10 # retrieve value <= probability value = chi2.ppf(p, df) print(value) # confirm with cdf p = chi2.cdf(value, df) print(p)

Running the example first calculates the value of 18.3 or less that covers 95% of the observations from the distribution. The probability of this observation is confirmed by using it as input to the CDF.

18.307038053275143 0.95

This section provides more resources on the topic if you are looking to go deeper.

- Critical value on Wikipedia
- P-value on Wikipedia
- One- and two-tailed tests on Wikipedia
- Quantile function on Wikipedia
- 68–95–99.7 rule on Wikipedia

In this tutorial, you discovered critical values, why they are important, how they are used, and how to calculate them in Python using SciPy.

Specifically, you learned:

- Examples of statistical hypothesis tests and their distributions from which critical values can be calculated and used.
- How exactly critical values are used on one-tail and two-tail statistical hypothesis tests.
- How to calculate critical values for the Gaussian, Student’s t, and Chi-Squared distributions.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Critical Values for Statistical Hypothesis Testing and How to Calculate Them in Python appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to Statistical Data Distributions appeared first on Machine Learning Mastery.

]]>The distribution provides a parameterized mathematical function that can be used to calculate the probability for any individual observation from the sample space. This distribution describes the grouping or the density of the observations, called the probability density function. We can also calculate the likelihood of an observation having a value equal to or lesser than a given value. A summary of these relationships between observations is called a cumulative density function.

In this tutorial, you will discover the Gaussian and related distribution functions and how to calculate probability and cumulative density functions for each.

After completing this tutorial, you will know:

- A gentle introduction to standard distributions to summarize the relationship of observations.
- How to calculate and plot probability and density functions for the Gaussian distribution.
- The Student t and Chi-squared distributions related to the Gaussian distribution.

Let’s get started.

This tutorial is divided into 4 parts; they are:

- Distributions
- Gaussian Distribution
- Student’s t-Distribution
- Chi-Squared Distribution

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

From a practical perspective, we can think of a distribution as a function that describes the relationship between observations in a sample space.

For example, we may be interested in the age of humans, with individual ages representing observations in the domain, and ages 0 to 125 the extent of the sample space. The distribution is a mathematical function that describes the relationship of observations of different heights.

A distribution is simply a collection of data, or scores, on a variable. Usually, these scores are arranged in order from smallest to largest and then they can be presented graphically.

— Page 6, Statistics in Plain English, Third Edition, 2010.

Many data conform to well-known and well-understood mathematical functions, such as the Gaussian distribution. A function can fit the data with a modification of the parameters of the function, such as the mean and standard deviation in the case of the Gaussian.

Once a distribution function is known, it can be used as a shorthand for describing and calculating related quantities, such as likelihoods of observations, and plotting the relationship between observations in the domain.

Distributions are often described in terms of their density or density functions.

Density functions are functions that describe how the proportion of data or likelihood of the proportion of observations change over the range of the distribution.

Two types of density functions are probability density functions and cumulative density functions.

**Probability Density function**: calculates the probability of observing a given value.**Cumulative Density function**: calculates the probability of an observation equal or less than a value.

A probability density function, or PDF, can be used to calculate the likelihood of a given observation in a distribution. It can also be used to summarize the likelihood of observations across the distribution’s sample space. Plots of the PDF show the familiar shape of a distribution, such as the bell-curve for the Gaussian distribution.

Distributions are often defined in terms of their probability density functions with their associated parameters.

A cumulative density function, or CDF, is a different way of thinking about the likelihood of observed values. Rather than calculating the likelihood of a given observation as with the PDF, the CDF calculates the cumulative likelihood for the observation and all prior observations in the sample space. It allows you to quickly understand and comment on how much of the distribution lies before and after a given value. A CDF is often plotted as a curve from 0 to 1 for the distribution.

Both PDFs and CDFs are continuous functions. The equivalent of a PDF for a discrete distribution is called a probability mass function, or PMF.

Next, let’s look at the Gaussian distribution and two other distributions related to the Gaussian that you will encounter when using statistical methods. We will look at each in turn in terms of their parameters, probability, and cumulative density functions.

The Gaussian distribution, named for Carl Friedrich Gauss, is the focus of much of the field of statistics.

Data from many fields of study surprisingly can be described using a Gaussian distribution, so much so that the distribution is often called the “*normal*” distribution because it is so common.

A Gaussian distribution can be described using two parameters:

**mean**: Denoted with the Greek lowercase letter mu, is the expected value of the distribution.**variance**: Denoted with the Greek lowercase letter sigma raised to the second power (because the units of the variable are squared), describes the spread of observation from the mean.

It is common to use a normalized calculation of the variance called the standard deviation

**standard deviation**: Denoted with the Greek lowercase letter sigma, describes the normalized spread of observations from the mean.

We can work with the Gaussian distribution via the norm SciPy module. The norm.pdf() function can be used to create a Gaussian probability density function with a given sample space, mean, and standard deviation.

The example below creates a Gaussian PDF with a sample space from -5 to 5, a mean of 0, and a standard deviation of 1. A Gaussian with these values for the mean and standard deviation is called the Standard Gaussian.

# plot the gaussian pdf from numpy import arange from matplotlib import pyplot from scipy.stats import norm # define the distribution parameters sample_space = arange(-5, 5, 0.001) mean = 0.0 stdev = 1.0 # calculate the pdf pdf = norm.pdf(sample_space, mean, stdev) # plot pyplot.plot(sample_space, pdf) pyplot.show()

Running the example creates a line plot showing the sample space in the x-axis and the likelihood of each value of the y-axis. The line plot shows the familiar bell-shape for the Gaussian distribution.

The top of the bell shows the most likely value from the distribution, called the expected value or the mean, which in this case is zero, as we specified in creating the distribution.

The norm.cdf() function can be used to create a Gaussian cumulative density function.

The example below creates a Gaussian CDF for the same sample space.

# plot the gaussian cdf from numpy import arange from matplotlib import pyplot from scipy.stats import norm # define the distribution parameters sample_space = arange(-5, 5, 0.001) # calculate the cdf cdf = norm.cdf(sample_space) # plot pyplot.plot(sample_space, cdf) pyplot.show()

Running the example creates a plot showing an S-shape with the sample space on the x-axis and the cumulative probability of the y-axis.

We can see that a value of 2 covers close to 100% of the observations, with only a very thin tail of the distribution beyond that point.

We can also see that the mean value of zero shows 50% of the observations before and after that point.

The Student’s t-distribution, or just t-distribution for short, is named for the pseudonym “Student” by William Sealy Gosset.

It is a distribution that arises when attempting to estimate the mean of a normal distribution with different sized samples. As such, it is a helpful shortcut when describing uncertainty or error related to estimating population statistics for data drawn from Gaussian distributions when the size of the sample must be taken into account.

Although you may not use the Student’s t-distribution directly, you may estimate values from the distribution required as parameters in other statistical methods, such as statistical significance tests.

The distribution can be described using a single parameter:

**number of degrees of freedom**: denoted with the lowercase Greek letter nu (v), denotes the number degrees of freedom.

Key to the use of the t-distribution is knowing the desired number of degrees of freedom.

The number of degrees of freedom describes the number of pieces of information used to describe a population quantity. For example, the mean has *n* degrees of freedom as all *n* observations in the sample are used to calculate the estimate of the population mean. A statistical quantity that makes use of another statistical quantity in its calculation must subtract 1 from the degrees of freedom, such as the use of the mean in the calculation of the sample variance.

Observations in a Student’s t-distribution are calculated from observations in a normal distribution in order to describe the interval for the populations mean in the normal distribution. Observations are calculated as:

data = (x - mean(x)) / S / sqrt(n)

Where *x* is the observations from the Gaussian distribution, *mean* is the average observation of *x*, S is the standard deviation and *n* is the total number of observations. The resulting observations form the t-observation with (*n – 1*) degrees of freedom.

In practice, if you require a value from a t-distribution in the calculation of a statistic, then the number of degrees of freedom will likely be *n – 1*, where *n* is the size of your sample drawn from a Gaussian distribution.

Which specific distribution you use for a given problem depends on the size of your sample.

— Page 93, Statistics in Plain English, Third Edition, 2010.

SciPy provides tools for working with the t-distribution in the stats.t module. The *t.pdf()* function can be used to create a Student t-distribution with the specified degrees of freedom.

The example below creates a t-distribution using the sample space from -5 to 5 and (10,000 – 1) degrees of freedom.

# plot the t-distribution pdf from numpy import arange from matplotlib import pyplot from scipy.stats import t # define the distribution parameters sample_space = arange(-5, 5, 0.001) dof = len(sample_space) - 1 # calculate the pdf pdf = t.pdf(sample_space, dof) # plot pyplot.plot(sample_space, pdf) pyplot.show()

Running the example creates and plots the t-distribution PDF.

We can see the familiar bell-shape to the distribution much like the normal. A key difference is the fatter tails in the distribution, highlighting the increased likelihood of observations in the tails compared to that of the Gaussian.

The *t.cdf()* function can be used to create the cumulative density function for the t-distribution. The example below creates the CDF over the same range as above.

# plot the t-distribution cdf from numpy import arange from matplotlib import pyplot from scipy.stats import t # define the distribution parameters sample_space = arange(-5, 5, 0.001) dof = len(sample_space) - 1 # calculate the cdf cdf = t.cdf(sample_space, dof) # plot pyplot.plot(sample_space, cdf) pyplot.show()

Running the example, we see the familiar S-shaped curve as we see with the Gaussian distribution, although with slightly softer transitions from zero-probability to one-probability for the fatter tails.

The chi-squared distribution is denoted as the lowecase Greek letter chi (X) raised to the second power (X^2).

Like the Student’s t-distribution, the chi-squared distribution is also used in statistical methods on data drawn from a Gaussian distribution to quantify the uncertainty. For example, the chi-squared distribution is used in the chi-squared statistical tests for independence. In fact, the chi-squared distribution is used in the derivation of the Student’s t-distribution.

The chi-squared distribution has one parameter:

*degrees of freedom*, denoted k.

An observation in a chi-squared distribution is calculated as the sum of *k* squared observations drawn from a Gaussian distribution.

chi = sum x[i]^2 for i=1 to k.

Where *chi* is an observation that has a chi-squared distribution, *x* are observation drawn from a Gaussian distribution, and *k* is the number of *x* observations which is also the number of degrees of freedom for the chi-squared distribution.

Again, as with the Student’s t-distribution, data does not fit a chi-squared distribution; instead, observations are drawn from this distribution in the calculation of statistical methods for a sample of Gaussian data.

SciPy provides the stats.chi2 module for calculating statistics for the chi-squared distribution. The chi2.pdf() function can be used to calculate the chi-squared distribution for a sample space between 0 and 50 with 20 degrees of freedom. Recall that the sum squared values must be positive, hence the need for a positive sample space.

# plot the chi-squared pdf from numpy import arange from matplotlib import pyplot from scipy.stats import chi2 # define the distribution parameters sample_space = arange(0, 50, 0.01) dof = 20 # calculate the pdf pdf = chi2.pdf(sample_space, dof) # plot pyplot.plot(sample_space, pdf) pyplot.show()

Running the example calculates the chi-squared PDF and presents it as a line plot.

With 20 degrees of freedom, we can see that the expected value of the distribution is just short of the value 20 on the sample space. This is intuitive if we think most of the density in the Gaussian distribution lies between -1 and 1 and then the sum of the squared random observations from the standard Gaussian would sum to just under the number of degrees of freedom, in this case 20.

Although the distribution has a bell-like shape, the distribution is not symmetric.

The chi2.cdf() function can be used to calculate the cumulative density function over the same sample space.

# plot the chi-squared cdf from numpy import arange from matplotlib import pyplot from scipy.stats import chi2 # define the distribution parameters sample_space = arange(0, 50, 0.01) dof = 20 # calculate the cdf cdf = chi2.cdf(sample_space, dof) # plot pyplot.plot(sample_space, cdf) pyplot.show()

Running the example creates a plot of the cumulative density function for the chi-squared distribution.

The distribution helps to see the likelihood for the chi-squared value around 20 with the fat tail to the right of the distribution that would continue on long after the end of the plot.

This section lists some ideas for extending the tutorial that you may wish to explore.

- Recreate the PDF and CDF plots for one distribution with a new sample space.
- Calculate and plot the PDF and CDF for the Cauchy and Laplace distributions.
- Look up and implement the equations for the PDF and CDF for one distribution from scratch.

If you explore any of these extensions, I’d love to know.

This section provides more resources on the topic if you are looking to go deeper.

- Statistics in Plain English, Third Edition, 2010.

- Probability density function on Wikipedia
- Cumulative distribution function on Wikipedia
- Probability mass function on Wikipedia
- Normal distribution on Wikipedia
- Student’s t-distribution on Wikipedia
- Chi-squared distribution on Wikipedia

In this tutorial, you discovered the Gaussian and related distribution functions and how to calculate probability and cumulative density functions for each.

Specifically, you learned:

- A gentle introduction to standard distributions to summarize the relationship of observations.
- How to calculate and plot probability and density functions for the Gaussian distribution.

The Student t and Chi-squared distributions related to the Gaussian distribution.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Statistical Data Distributions appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to Data Visualization Methods in Python appeared first on Machine Learning Mastery.

]]>Being able to quickly visualize your data samples for yourself and others is an important skill both in applied statistics and in applied machine learning.

In this tutorial, you will discover the five types of plots that you will need to know when visualizing data in Python and how to use them to better understand your own data.

After completing this tutorial, you will know:

- How to chart time series data with line plots and categorical quantities with bar charts.
- How to summarize data distributions with histograms and box plots.
- How to summarize the relationship between variables with scatter plots.

Let’s get started.

This tutorial is divided into 7 parts; they are:

- Data Visualization
- Introduction to Matplotlib
- Line Plot
- Bar Chart
- Histogram Plot
- Box and Whisker Plot
- Scatter Plot

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Data visualization is an important skill in applied statistics and machine learning.

Statistics does indeed focus on quantitative descriptions and estimations of data. Data visualization provides an important suite of tools for gaining a qualitative understanding.

This can be helpful when exploring and getting to know a dataset and can help with identifying patterns, corrupt data, outliers, and much more. With a little domain knowledge, data visualizations can be used to express and demonstrate key relationships in plots and charts that are more visceral to yourself and stakeholders than measures of association or significance.

Data visualization and exploratory data analysis are whole fields themselves and I will recommend a deeper dive into some the books mentioned at the end. In this tutorial, let’s look at basic charts and plots you can use to better understand your data.

There are five key plots that you need to know well for basic data visualization. They are:

- Line Plot
- Bar Chart
- Histogram Plot
- Box and Whisker Plot
- Scatter Plot

With a knowledge of these plots, you can quickly get a qualitative understanding of most data that you come across.

For the rest of this tutorial, we will take a closer look at each plot type.

There are many excellent plotting libraries in Python and I recommend exploring them in order to create presentable graphics.

For quick and dirty plots intended for your own use, I recommend using the matplotlib library. It is the foundation for many other plotting libraries and plotting support in higher-level libraries such as Pandas.

The matplotlib provides a context, one in which one or more plots can be drawn before the image is shown or saved to file. The context can be accessed via functions on *pyplot*. The context can be imported as follows:

import matplotlib import pyplot

There is some convention to import this context and name it *plt*; for example:

import matplotlib.pyplot as plt

We will not use this convention, instead we will stick to the standard Python import convention.

Charts and plots are made by making and calling on context; for example:

pyplot.plot(...)

Elements such as axis, labels, legends, and so on can be accessed and configured on this context as separate function calls.

The drawings on the context can be shown in a new window by calling the show() function:

# display the plot pyplot.show()

Alternately, the drawings on the context can be saved to file, such as a PNG formatted image file. The savefig() function can be used to save images.

pyplot.savefig('my_image.png')

This is the most basic crash course for using the matplotlib library.

For more detail, see the User Guide and the resources at the end of the tutorial.

A line plot is generally used to present observations collected at regular intervals.

The x-axis represents the regular interval, such as time. The y-axis shows the observations, ordered by the x-axis and connected by a line.

A line plot can be created by calling the plot() function and passing the x-axis data for the regular interval, and y-axis for the observations.

# create line plot pyplot.plot(x, y)

Line plots are useful for presenting time series data as well as any sequence data where there is an ordering between observations.

The example below creates a sequence of 100 floating point values as the x-axis and a sine wave as a function of the x-axis as the observations on the y-axis. The results are plotted as a line plot.

# example of a line plot from numpy import sin from matplotlib import pyplot # consistent interval for x-axis x = [x*0.1 for x in range(100)] # function of x for y-axis y = sin(x) # create line plot pyplot.plot(x, y) # show line plot pyplot.show()

Running the example creates a line plot showing the familiar sine wave pattern on the y-axis across the x-axis with a consistent interval between observations.

A bar chart is generally used to present relative quantities for multiple categories.

The x-axis represents the categories and are spaced evenly. The y-axis represents the quantity for each category and is drawn as a bar from the baseline to the appropriate level on the y-axis.

A bar chart can be created by calling the bar() function and passing the category names for the x-axis and the quantities for the y-axis.

# create bar chart pyplot.bar(x, y)

Bar charts can be useful for comparing multiple point quantities or estimations.

The example below creates a dataset with three categories, each defined with a string label. A single random integer value is drawn for the quantity in each category.

# example of a bar chart from random import seed from random import randint from matplotlib import pyplot # seed the random number generator seed(1) # names for categories x = ['red', 'green', 'blue'] # quantities for each category y = [randint(0, 100), randint(0, 100), randint(0, 100)] # create bar chart pyplot.bar(x, y) # show line plot pyplot.show()

Running the example creates the bar chart showing the category labels on the x-axis and the quantities on the y-axis.

A histogram plot is generally used to summarize the distribution of a data sample.

The x-axis represents discrete bins or intervals for the observations. For example observations with values between 1 and 10 may be split into five bins, the values [1,2] would be allocated to the first bin, [3,4] would be allocated to the second bin, and so on.

The y-axis represents the frequency or count of the number of observations in the dataset that belong to each bin.

Essentially, a data sample is transformed into a bar chart where each category on the x-axis represents an interval of observation values.

Histograms are density estimates. A density estimate gives a good impression of the distribution of the data.[…] The idea is to locally represent the data density by counting the number of observations in a sequence of consecutive intervals (bins) …

— Page 11, Applied Multivariate Statistical Analysis, 2015.

A histogram plot can be created by calling the hist() function and passing in a list or array that represents the data sample.

# create histogram plot pyplot.hist(x)

Histograms are valuable for summarizing the distribution of data samples.

The example below creates a dataset of 1,000 random numbers drawn from a standard Gaussian distribution, then plots the dataset as a histogram.

# example of a histogram plot from numpy.random import seed from numpy.random import randn from matplotlib import pyplot # seed the random number generator seed(1) # random numbers drawn from a Gaussian distribution x = randn(1000) # create histogram plot pyplot.hist(x) # show line plot pyplot.show()

Running the example, we can see that the shape of the bars shows the bell-shaped curve of the Gaussian distribution. We can see that the function automatically chose the number of bins, in this case splitting the values into groups by integer value.

Often, careful choice of the number of bins can help to better expose the shape of the data distribution. The number of bins can be specified by setting the “*bins*” argument; for example:

# create histogram plot pyplot.hist(x, bins=100)

A box and whisker plot, or boxplot for short, is generally used to summarize the distribution of a data sample.

The x-axis is used to represent the data sample, where multiple boxplots can be drawn side by side on the x-axis if desired.

The y-axis represents the observation values. A box is drawn to summarize the middle 50% of the dataset starting at the observation at the 25th percentile and ending at the 75th percentile. The median, or 50th percentile, is drawn with a line. A value called the interquartile range, or IQR, is calculated as 1.5 * the difference between the 75th and 25th percentiles. Lines called whiskers are drawn extending from both ends of the box with the length of the IQR to demonstrate the expected range of sensible values in the distribution. Observations outside the whiskers might be outliers and are drawn with small circles.

The boxplot is a graphical technique that displays the distribution of variables. It helps us see the location, skewness, spread, tile length and outlying points. […] The boxplot is a graphical representation of the Five Number Summary.

— Page 5, Applied Multivariate Statistical Analysis, 2015.

Boxplots can be drawn by calling the boxplot() function passing in the data sample as an array or list.

# create box and whisker plot pyplot.boxplot(x)

Boxplots are useful to summarize the distribution of a data sample as an alternative to the histogram. They can help to quickly get an idea of the range of common and sensible values in the box and in the whisker respectively. Because we are not looking at the shape of the distribution explicitly, this method is often used when the data has an unknown or unusual distribution, such as non-Gaussian.

The example below creates three boxplots in one chart, each summarizing a data sample drawn from a slightly different Gaussian distribution. Each data sample is created as an array and all three data sample arrays are added to a list that is padded to the plotting function.

# example of a box and whisker plot from numpy.random import seed from numpy.random import randn from matplotlib import pyplot # seed the random number generator seed(1) # random numbers drawn from a Gaussian distribution x = [randn(1000), 5 * randn(1000), 10 * randn(1000)] # create box and whisker plot pyplot.boxplot(x) # show line plot pyplot.show()

Running the example creates a chart showing the three box and whisker plots. We can see that the same scale is used on the y-axis for each, making the first plot look squashed and the last plot look spread out.

In this case, we can see the black box for the middle 50% of the data, the orange line for the median, the lines for the whiskers summarizing the range of sensible data, and finally dots for the possible outliers.

A scatter plot (or ‘scatterplot’) is generally used to summarize the relationship between two paired data samples.

Paired data samples means that two measures were recorded for a given observation, such as the weight and height of a person.

The x-axis represents observation values for the first sample, and the y-axis represents the observation values for the second sample. Each point on the plot represents a single observation.

Scatterplots are bivariate or trivariate plots of variables against each other. They help us understand relationships among the variables of a dataset. A downward-sloping scatter indicates that as we increase the variable on the horizontal axis, the variable on the vertical axis decreases.

— Page 19, Applied Multivariate Statistical Analysis, 2015.

Scatter plots can be created by calling the scatter() function and passing the two data sample arrays.

# create scatter plot pyplot.scatter(x, y)

Scatter plots are useful for showing the association or correlation between two variables. A correlation can be quantified, such as a line of best fit, that too can be drawn as a line plot on the same chart, making the relationship clearer.

A dataset may have more than two measures (variables or columns) for a given observation. A scatter plot matrix is a cart containing scatter plots for each pair of variables in a dataset with more than two variables.

The example below creates two data samples that are related. The first is a sample of random numbers drawn from a standard Gaussian. The second is dependent upon the first by adding a second random Gaussian value to the value of the first measure.

# example of a scatter plot from numpy.random import seed from numpy.random import randn from matplotlib import pyplot # seed the random number generator seed(1) # first variable x = 20 * randn(1000) + 100 # second variable y = x + (10 * randn(1000) + 50) # create scatter plot pyplot.scatter(x, y) # show line plot pyplot.show()

Running the example creates the scatter plot, showing the positive relationship between the two variables.

This section lists some ideas for extending the tutorial that you may wish to explore.

- Select one example and update it to use your own contrived dataset.
- Load a standard machine learning dataset and plot the variables.
- Write convenience functions to easily create plots for your data, including labels and legends.

If you explore any of these extensions, I’d love to know.

This section provides more resources on the topic if you are looking to go deeper.

- Visualize Machine Learning Data in Python With Pandas
- Time Series Data Visualization with Python
- Data Visualization with the Caret R package

- The Visual Display of Quantitative Information, 2001.
- Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython, 2017.
- Applied Multivariate Statistical Analysis, 2015.

- matplotlib library
- matplotlib User Guide
- matplotlib.pyplot() API
- matplotlib.pyplot.show() API
- matplotlib.pyplot.savefig() API
- matplotlib.pyplot.plot() API
- matplotlib.pyplot.bar() API
- matplotlib.pyplot.hist() API
- matplotlib.pyplot.boxplot() API
- matplotlib.pyplot.scatter() API

- Data visualization on Wikipedia
- Bar chart on Wikipedia
- Histogram on Wikipedia
- Box plot on Wikipedia
- Interquartile range on Wikipedia
- Scatter plot on Wikipedia

In this tutorial, you discovered a gentle introduction to visualization data in Python.

Specifically, you learned:

- How to chart time series data with line plots and categorical quantities with bar charts.
- How to summarize data distributions with histograms and boxplots.
- How to summarize the relationship between variables with scatter plots.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Data Visualization Methods in Python appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to Estimation Statistics for Machine Learning appeared first on Machine Learning Mastery.

]]>A group of methods referred to as “*new statistics*” are seeing increased use instead of or in addition to p-values in order to quantify the magnitude of effects and the amount of uncertainty for estimated values. This group of statistical methods is referred to as “*estimation statistics*“.

In this tutorial, you will discover a gentle introduction to estimation statistics as an alternate or complement to statistical hypothesis testing.

After completing this tutorial, you will know:

- Effect size methods involve quantifying the association or difference between samples.
- Interval estimate methods involve quantifying the uncertainty around point estimations.
- Meta-analyses involve quantifying the magnitude of an effect across multiple similar independent studies.

Let’s get started.

This tutorial is divided into 5 parts; they are:

- Problems with Hypothesis Testing
- Estimation Statistics
- Effect Size
- Interval Estimation
- Meta-Analysis

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Statistical hypothesis testing and the calculation of p-values is a popular way to present and interpret results.

Tests like the Student’s t-test can be used to describe whether or not two samples have the same distribution. They can help interpret whether the difference between two sample means is likely real or due to random chance.

Although they are widely used, they have some problems. For example:

- Calculated p-values are easily misused and misunderstood.
- There’s always some significant difference between samples, even if the difference is tiny.

Interestingly, in the last few decades there has been a push back against the use of p-values in the presentation of research. For example, in the 1990s, the journal of Epidemiology banned the use of p-values. Many related areas in medicine and psychology have followed suit.

Although p-values may still be used, there is a push toward the presentation of results using estimation statistics.

Estimation statistics refers to methods that attempt to quantify a finding.

This might include quantifying the size of an effect or the amount of uncertainty for a specific outcome or result.

… ‘estimation statistics,’ a term describing the methods that focus on the estimation of effect sizes (point estimates) and their confidence intervals (precision estimates).

— Estimation statistics should replace significance testing, 2016.

Estimation statistics is a term to describe three main classes of methods. The three main classes of methods include:

**Effect Size**. Methods for quantifying the size of an effect given a treatment or intervention.**Interval Estimation**. Methods for quantifying the amount of uncertainty in a value.**Meta-Analysis**. Methods for quantifying the findings across multiple similar studies.

We will look at each of these groups of methods in more detail in the following sections.

Although they are not new, they are being called the “*new statistics*” given their increased use in research literature over statistical hypothesis testing.

The new statistics refer to estimation, meta-analysis, and other techniques that help researchers shift emphasis from [null hypothesis statistical tests]. The techniques are not new and are routinely used on some disciplines, but for the [null hypothesis statistical tests] disciplines, their use would be new and beneficial.

— Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis, 2012.

The main reason that the shift from statistical hypothesis methods to estimation systems is occurring is the results are easier to analyze and interpret in the context of the domain or research question.

The quantified size of the effect and uncertainty allows claims to be made that are easier to understand and use. The results are more meaningful.

Knowing and thinking about the magnitude and precision of an effect is more useful to quantitative science than contemplating the probability of observing data of at least that extremity, assuming absolutely no effect.

— Estimation statistics should replace significance testing, 2016.

Where statistical hypothesis tests talk about whether the samples come from the same distribution or not, estimation statistics can describe the size and confidence of the difference. This allows you to comment on how different one method is from another.

Estimating thinking focuses on how big an effect is; knowing this is usually more valuable than knowing whether or not the effect is zero, which is the fours of dichotomous thinking. Estimation thinking prompts us to plan an experiment to address the question, “How much…?” or “To what extent…?” rater than only the dichotomous null hypothesis statistical tests] question, “Is there an effect?”

— Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis, 2012.

The effect size describes the magnitude of a treatment or difference between two samples.

A hypothesis test can comment on whether the difference between samples is the result of chance or is real, whereas an effect size puts a number on how much the samples differ.

Measuring the size of an effect is a big part of applied machine learning, and in fact, research in general.

I am sometimes asked, what do researchers do? The short answer is that we estimate the size of effects. No matter what phenomenon we have chosen to study we essentially spend our careers thinking up new and better ways to estimate effect magnitudes.

— Page 3, The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis, and the Interpretation of Research Results, 2010.

There are two main classes of techniques used to quantify the magnitude of effects; they are:

**Association**. The degree to which two samples change together.**Difference**. The degree to which two samples are different.

For example, association effect sizes include calculations of correlation, such as the Pearson’s correlation coefficient, and the r^2 coefficient of determination. They may quantify the linear or monotonic way that observations in two samples change together.

Difference effect size may include methods such as Cohen’s d statistic that provide a standardized measure for how the means of two populations differ. They seek a quantification for the magnitude of difference between observations in two samples.

An effect can be the result of a treatment revealed in a comparison between groups (e.g., treated and untreated groups) or it can describe the degree of association between two related variables (e.g., treatment dosage and health).

— Page 4, The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis, and the Interpretation of Research Results, 2010.

Interval estimation refers to statistical methods for quantifying the uncertainty for an observation.

Intervals transform a point estimate into a range that provides more information about the estimate, such as its precision, making them easier to compare and interpret.

The point estimates are the dots, and the intervals indicate the uncertainty of those point estimates.

— Page 9, Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis, 2012.

There are three main types of intervals that are commonly calculated. They are:

**Tolerance Interval**: The bounds or coverage of a proportion of a distribution with a specific level of confidence.**Confidence Interval**: The bounds on the estimate of a population parameter.**Prediction Interval**: The bounds on a single observation.

A tolerance interval may be used to set expectations on observations in a population or help to identify outliers. A confidence interval can be used to interpret the range for a mean of a data sample that can become more precise as the sample size is increased. A prediction interval can be used to provide a range for a prediction or forecast from a model.

For example, when presenting the mean estimated skill of a model, a confidence interval can be used to provide bounds on the precision of the estimate. This could also be combined with p-values if models are being compared.

The confidence interval thus provides a range of possibilities for the population value, rather than an arbitrary dichotomy based solely on statistical significance. It conveys more useful information at the expense of precision of the P value. However, the actual P value is helpful in addition to the confidence interval, and preferably both should be presented. If one has to be excluded, however, it should be the P value.

— Confidence intervals rather than P values: estimation rather than hypothesis testing, 1986.

A meta-analysis refers to the use of a weighting of multiple similar studies in order to quantify a broader cross-study effect.

Meta studies are useful when many small and similar studies have been performed with noisy and conflicting findings. Instead of taking the study conclusions at face value, statistical methods are used to combine multiple findings into a stronger finding than any single study.

… better known as meta-analysis, complete ignores the conclusions that others have drawn and looks instead at the effects that have been observed. The aim is to combine these independent observations into an average effect size and draw an overall conclusion regarding the direction and magnitude of real-world effects.

— Page 90, The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis, and the Interpretation of Research Results, 2010.

Although not often used in applied machine learning, it is useful to note meta-analyses as they form part of this trust of new statistical methods.

This section lists some ideas for extending the tutorial that you may wish to explore.

- Describe three examples of how estimation statistics can be used in a machine learning project.
- Locate and summarize three criticisms against the use of statistical hypothesis testing.
- Search and locate three research papers that make use of interval estimates.

If you explore any of these extensions, I’d love to know.

This section provides more resources on the topic if you are looking to go deeper.

- Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis, 2012.
- Introduction to the New Statistics: Estimation, Open Science, and Beyond, 2016.
- The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis, and the Interpretation of Research Results, 2010.

- Estimation statistics should replace significance testing, 2016.
- Confidence intervals rather than P values: estimation rather than hypothesis testing, 1986.

- Estimation statistics on Wikipedia
- Effect size on Wikipedia
- Interval estimation on Wikipedia
- Meta-analysis on Wikipedia

In this tutorial, you discovered a gentle introduction to estimation statistics as an alternate or complement to statistical hypothesis testing.

Specifically, you learned:

- Effect size methods involve quantifying the association or difference between samples.
- Interval estimate methods involve quantifying the uncertainty around point estimations.
- Meta-analyses involve quantifying the magnitude of an effect across multiple similar independent studies.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Estimation Statistics for Machine Learning appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to Statistical Tolerance Intervals in Machine Learning appeared first on Machine Learning Mastery.

]]>These bounds can be used to help identify anomalies and set expectations for what to expect. A bound on observations from a population is called a tolerance interval.

A tolerance interval is different from a prediction interval that quantifies the uncertainty for a single predicted value. It is also different from a confidence interval that quantifies the uncertainty of a population parameter such as a mean. Instead, a tolerance interval covers a proportion of the population distribution.

In this tutorial, you will discover statistical tolerance intervals and how to calculate a tolerance interval for Gaussian data.

After completing this tutorial, you will know:

- That statistical tolerance intervals provide a bounds on observations from a population.
- That a tolerance interval requires that both a coverage proportion and confidence be specified.
- That the tolerance interval for a data sample with a Gaussian distribution can be easily calculated.

Let’s get started.

This tutorial is divided into 4 parts; they are:

- Bounds on Data
- What Are Statistical Tolerance Intervals?
- How to Calculate Tolerance Intervals
- Tolerance Interval for Gaussian Distribution

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

It is useful to put bounds on data.

For example, if you have a sample of data from a domain, knowing the upper and lower bound for normal values can be helpful for identifying anomalies or outliers in the data.

For a process or model that is making predictions, it can be helpful to know the expected range that sensible predictions may take.

Knowing the common range of values can help with setting expectations and detecting anomalies.

The range of common values for data is called a tolerance interval.

The tolerance interval is a bound on an estimate of the proportion of data in a population.

A statistical tolerance interval [contains] a specified proportion of the units from the sampled population or process.

— Page 3, Statistical Intervals: A Guide for Practitioners and Researchers, 2017.

The interval is limited by the sampling error and by the variance of the population distribution. Given the law of large numbers, as the sample size is increased, the probabilities will better match the underlying population distribution.

Below is an example of a stated tolerance interval:

*The range from x to y covers 95% of the data with a confidence of 99%.*

If the data is Gaussian, the interval can be expressed in the context of the mean value; for example:

*x +/- y covers 95% of the data with a confidence of 99%.*

We refer to these intervals as statistical tolerance intervals, to differentiate them from tolerance intervals in engineering that describe limits of acceptability, such as for a design or of a material. Generally, we will describe them as simply “tolerance intervals” for convenience.

A tolerance interval is defined in terms of two quantities:

**Coverage**: The proportion of the population covered by the interval.**Confidence**: The probabilistic confidence that the interval covers the proportion of the population.

The tolerance interval is constructed from data using two coefficients, the coverage and the tolerance coefficient. The coverage is the proportion of the population (p) that the interval is supposed to contain. The tolerance coefficient is the degree of confidence with which the interval reaches the specified coverage. A tolerance interval with coverage of 95% and a tolerance coefficient of 90% will contain 95% of the population distribution with a confidence of 90%.

— Page 175, Statistics for Environmental Engineers, Second Edition, 2002.

The size of a tolerance interval is proportional to the size of the data sample from the population and the variance of the population.

There are two main methods for calculating tolerance intervals depending on the distribution of data: parametric and nonparametric methods.

**Parametric Tolerance Interval**: Use knowledge of the population distribution in specifying both the coverage and confidence. Often used to refer to a Gaussian distribution.**Nonparametric Tolerance Interval**: Use rank statistics to estimate the coverage and confidence, often resulting less precision (wider intervals) given the lack of information about the distribution.

Tolerance intervals are relatively straightforward to calculate for a sample of independent observations drawn from a Gaussian distribution. We will demonstrate this calculation in the next section.

In this section, we will work through an example of calculating the tolerance intervals on a data sample.

First, let’s define our data sample. We will create a sample of 100 observations drawn from a Gaussian distribution with a mean of 50 and a standard deviation of 5.

# generate dataset data = 5 * randn(100) + 50

During the example, we will assume that we are unaware of the true population mean and standard deviation, and that these values must be estimated.

Because the population parameters have to be estimated, there is additional uncertainty. For example, for a 95% coverage, we could use 1.96 (or 2) standard deviations from the estimated mean as the tolerance interval. We must estimate the mean and standard deviation from the sample and take this uncertainty into account, therefore the calculation of the interval is slightly more complex.

Next, we must specify the number of degrees of freedom. This will be used in the calculation of critical values and in the calculation of the interval. Specifically, it is used in the calculation of the standard deviation.

Remember that the degrees of freedom are the number of values in the calculation that can vary. Here, we have 100 observations, therefore 100 degrees of freedom. We do not know the standard deviation, therefore it must be estimated using the mean. This means our degrees of freedom will be (N – 1) or 99.

# specify degrees of freedom n = len(data) dof = n - 1

Next, we must specify the proportional coverage of the data. In this example, we are interested in the middle 95% of the data. The proportion is 95. We must shift this proportion so that it covers the middle 95%, that is from 2.5th percentile to the 97.5th percentile.

We know that the critical value for 95% is 1.96 given that we use it so often; nevertheless, we can calculate it directly in Python given the percentage 2.5% of the inverse survival function. This can be calculated using the norm.isf() SciPy function.

# specify data coverage prop = 0.95 prop_inv = (1.0 - prop) / 2.0 gauss_critical = norm.isf(prop_inv)

Next, we need to calculate the confidence of the coverage. We can do this by retrieving the critical value from the Chi Squared distribution for the given number of degrees of freedom and desired probability. We can use the chi2.isf() SciPy function.

# specify confidence prob = 0.99 chi_critical = chi2.isf(q=prob, df=dof)

We now have all of the pieces to calculate the Gaussian tolerance interval. The calculation is as follows:

interval = sqrt((dof * (1 + (1/n)) * gauss_critical^2) / chi_critical)

Where *dof* is the number of degrees of freedom, *n* is the size of the data sample, *gauss_critical* is the critical value, such as 1.96 for 95% coverage of the population, and *chi_critical* is the Chi Squared critical value for the desired confidence and degrees of freedom.

interval = sqrt((dof * (1 + (1/n)) * gauss_critical**2) / chi_critical)

We can tie all of this together and calculate the Gaussian tolerance interval for our data sample.

The complete example is listed below.

# parametric tolerance interval from numpy.random import seed from numpy.random import randn from numpy import mean from numpy import sqrt from scipy.stats import chi2 from scipy.stats import norm # seed the random number generator seed(1) # generate dataset data = 5 * randn(100) + 50 # specify degrees of freedom n = len(data) dof = n - 1 # specify data coverage prop = 0.95 prop_inv = (1.0 - prop) / 2.0 gauss_critical = norm.isf(prop_inv) print('Gaussian critical value: %.3f (coverage=%d%%)' % (gauss_critical, prop*100)) # specify confidence prob = 0.99 chi_critical = chi2.isf(q=prob, df=dof) print('Chi-Squared critical value: %.3f (prob=%d%%, dof=%d)' % (chi_critical, prob*100, dof)) # tolerance interval = sqrt((dof * (1 + (1/n)) * gauss_critical**2) / chi_critical) print('Tolerance Interval: %.3f' % interval) # summarize data_mean = mean(data) lower, upper = data_mean-interval, data_mean+interval print('%.2f to %.2f covers %d%% of data with a confidence of %d%%' % (lower, upper, prop*100, prob*100))

Running the example first calculates and prints the relevant critical values for the Gaussian and Chi Squared distributions. The tolerance is printed, then presented correctly.

Gaussian critical value: 1.960 (coverage=95%) Chi-Squared critical value: 69.230 (prob=99%, dof=99) Tolerance Interval: 2.355 47.95 to 52.66 covers 95% of data with a confidence of 99%

It can also be helpful to demonstrate how the tolerance interval will decrease (become more precise) as the size of the sample is increased.

The example below demonstrates this by calculating the tolerance interval for different sample sizes for the same small contrived problem.

# plot tolerance interval vs sample size from numpy.random import seed from numpy.random import randn from numpy import sqrt from scipy.stats import chi2 from scipy.stats import norm from matplotlib import pyplot # seed the random number generator seed(1) # sample sizes sizes = range(5,15) for n in sizes: # generate dataset data = 5 * randn(n) + 50 # calculate degrees of freedom dof = n - 1 # specify data coverage prop = 0.95 prop_inv = (1.0 - prop) / 2.0 gauss_critical = norm.isf(prop_inv) # specify confidence prob = 0.99 chi_critical = chi2.isf(q=prob, df=dof) # tolerance tol = sqrt((dof * (1 + (1/n)) * gauss_critical**2) / chi_critical) # plot pyplot.errorbar(n, 50, yerr=tol, color='blue', fmt='o') # plot results pyplot.show()

Running the example creates a plot showing the tolerance interval around the true population mean.

We can see that the interval becomes smaller (more precise) as the sample size is increased from 5 to 15 examples.

This section lists some ideas for extending the tutorial that you may wish to explore.

- List 3 cases where a tolerance interval could be used in a machine learning project.
- Locate a dataset with a Gaussian variable and calculate tolerance intervals for it.
- Research and describe one method for calculating a nonparametric tolerance interval.

If you explore any of these extensions, I’d love to know.

This section provides more resources on the topic if you are looking to go deeper.

- Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis, 2017.
- Statistical Intervals: A Guide for Practitioners and Researchers, 2017.

- Tolerance interval on Wikipedia
- 68–95–99.7 rule on Wikipedia
- Percentile on Wikipedia
- Tolerance intervals for a normal distribution

In this tutorial, you discovered statistical tolerance intervals and how to calculate a tolerance interval for Gaussian data.

Specifically, you learned:

- That statistical tolerance intervals provide a bounds on observations from a population.
- That a tolerance interval requires that both a coverage proportion and confidence be specified.
- That the tolerance interval for a data sample with a Gaussian distribution can be easily calculated.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Statistical Tolerance Intervals in Machine Learning appeared first on Machine Learning Mastery.

]]>