A Gentle Introduction to Degrees of Freedom in Machine Learning

By Jason Brownlee on August 19, 2020 in Statistics 14

Degrees of freedom is an important concept from statistics and engineering.

It is often employed to summarize the number of values used in the calculation of a statistic, such as a sample statistic or in a statistical hypothesis test.

In machine learning, the degrees of freedom may refer to the number of parameters in the model, such as the number of coefficients in a linear regression model or the number of weights in a deep learning neural network.

The concern is that if there are more degrees of freedom (model parameters) in machine learning, then the model is expected to overfit the training dataset. This is the common understanding from statistics. This expectation can be overcome through the use of regularization techniques, such as regularization linear regression and the suite of regularization methods available for deep learning neural network models.

In this post, you will discover degrees of freedom in statistics and machine learning.

After reading this post, you will know:

Degrees of freedom generally represents the number of points of control of a system.
In statistics, degrees of freedom is the number of observations used to calculate a statistic.
In machine learning, degrees of freedom is the number of parameters of a model.

Kick-start your project with my new book Statistics for Machine Learning, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

A Gentle Introduction to Degrees of Freedom in Machine Learning
Photo by daveynin, some rights reserved.

Overview

This tutorial is divided into three parts; they are:

Degrees of Freedom
Degrees of Freedom in Statistics
Degrees of Freedom in Machine Learning
1. Degrees of Freedom for a Linear Regression Model
2. Degrees of Freedom for Linear Regression Error
3. Total Degrees of Freedom for Linear Regression
4. Negative Degrees of Freedom
5. Degrees of Freedom and Overfitting

Degrees of Freedom

Degrees of freedom represent the number of points of control of a system, model, or calculation.

Each independent parameter that can change is a separate dimension in a d-dimensional space that defines the scope of values that may influence the system, where the specific observed or specified values are a single point in that space.

Mathematically, the degrees of freedom is often represented using the Greek letter nu, which looks like a lower-case “v”.

It may also be abbreviated as “d.o.f,” “dof,” “d.f.,” or simply “df.”

Degrees of freedom is a term from statistics and engineering and may be used in machine learning.

Degrees of Freedom in Statistics

In statistics, the degrees of freedom is the number of values used in the calculation of a statistic that can change.

Degrees of freedom: Roughly, the minimum amount of data needed to calculate a statistic. More practically, it is a number, or numbers, used to approximate the number of observations in the data set for the purpose of determining statistical significance.

— Page 60, Statistics in Plain English, 3rd Edition, 2010.

It is calculated as the number of independent values used in the calculation of the statistic minus the number of statistics calculated.

degrees of freedom = number of independent values – number of statistics

For example, we may have 50 independent samples and we wish to calculate a statistic of the sample, like the mean. All 50 samples are used in the calculation and there is one statistic, so the number of degrees of freedom for the mean, in this case, is calculated as:

degrees of freedom = number of independent values – number of statistics
degrees of freedom = 50 – 1
degrees of freedom = 49

Degrees of freedom is often an important consideration in data distributions and statistical hypothesis tests. For example, it used to be common to have tables of statistical test critical values calculated for different common degrees of freedom (before calculating the statistic directly was easy and common).

So far, so good, but what about a model fit from data, such as in machine learning?

Degrees of Freedom in Machine Learning

In predictive modeling, the degrees of freedom often refers to the number of parameters in the model that are estimated from data.

This can also include both the coefficients of the model and the data used in the calculation of the error of the model.

The best case for understanding this is with a linear regression model.

Degrees of Freedom for a Linear Regression Model

Consider a linear regression model for a dataset that has two input variables.

We will require one coefficient in the model for each of the input variables, e.g. the model will have two parameters.

This model looks as follows, where x1 and x2 are the input variables and beta1 and beta2 are the model parameters.

yhat = x1 * beta1 + x2 * beta2

This linear regression model has two degrees of freedom because there are two parameters in the model that must be estimated from a training dataset. Adding one more column to the data (one more input variable) would add one more degree of freedom for the model.

model degrees of freedom = number of parameters estimated from data

It is common to describe the complexity of a model fit from data based on the number of parameters that were fit.

For example, the complexity of a linear regression model with two parameters is equal to the degrees of freedom, which in this case is 2. We often prefer lower complexity models over higher complexity models. Simpler models generalize better.

The degrees of freedom are an accounting of how many parameters are estimated by the model and, by extension, a measure of complexity for linear regression models.

— Page 71, Applied Predictive Modeling, 2013.

It’s not over yet.

Degrees of Freedom for Linear Regression Error

The number of training examples matters and impacts the overall degrees of freedom for the regression model.

Consider that the coefficients of the linear regression model are fit using a training dataset with 100 rows or examples.

The model is fit by minimizing the error between the model predictions and the expected output values. The total error of the model has one degree of freedom for each example in the training dataset minus the number of parameters estimated from the data.

In this case, the model error has 100 minus 2 parameters from the model, or 98 degrees of freedom.

model error degrees of freedom = number of observations – number of parameters
model error degrees of freedom = 100 – 2
model error degrees of freedom = 98

It is often good practice to report the error of a linear model, like linear regression, including the degrees of freedom of the error.

At the very least, the number of observations in the training data can be included so that the model error degrees of freedom can be determined.

Total Degrees of Freedom for Linear Regression

The total degrees of freedom for the linear regression model is taken as the sum of the model degrees of freedom plus the model error degrees of freedom.

linear regression degrees of freedom = model degrees of freedom + model error degrees of freedom
linear regression degrees of freedom = 2 + 98
linear regression degrees of freedom = 100

Generally, the degrees of freedom is equal to the number of rows of training data used to fit the model.

Consider a dataset with 100 rows of data as before, but now we have 70 input variables.

This means that the model has 70 coefficients or parameters fit from the data. The model error would therefore be 100 – 70, or 30 degrees of freedom.

The total degrees of freedom for the model is still equal to the number of rows, or 70 + 30.

Negative Degrees of Freedom

What happens when we have more columns than rows of data?

For example, we may have 100 rows of data and 10,000 variables, such as gene markers for 100 patients.

A linear regression model would therefore have 10,000 parameters, meaning the model would have 10,000 degrees of freedom.

We can calculate the model error degrees of freedom as follows:

model error degrees of freedom = number of observations – number of parameters
model error degrees of freedom = 100 – 10,000
model error degrees of freedom = -9,900

Uh oh.

And we can calculate the total degrees of freedom as follows:

linear regression degrees of freedom = model degrees of freedom + model error degrees of freedom
linear regression degrees of freedom = 10,000 + -9,900
linear regression degrees of freedom = 100

The model has 100 total degrees of freedom, but the model error has a negative degrees of freedom.

A negative degree of freedom is valid.

It suggests that we have more statistics than we have values that can change. In this case, we have more parameters in the model than we have rows of data or observations to train the model.

This is a so-called p >> n or having many more predictors p than we do samples n.

Degrees of Freedom and Overfitting

The problem is that when we have more parameters than observations, there is a risk of overfitting the training dataset.

This is intuitive if we think of each coefficient in the model as a point of control. If we have more points of control in the model than we have observations, we can, in theory, configure the model to predict the training dataset correctly and exactly. Learning the details of the training dataset at the expense of performing well on new data is the definition of overfitting.

This is the general concern that statisticians have about deep learning neural network models.

That is, deep learning models often have many more parameters (model weights) than samples (e.g. billions of weights), and using our understanding of linear models, are expected to overfit.

Nevertheless, through careful selection of model architectures and regularization techniques, they can be prevented from overfitting and maintain low generalization error.

Further, in deep models, the effective degrees of freedom may be decoupled from the number of parameters in the model.

We showed that for simple classification models, degrees of freedom is equal to the number of parameters in the model. In deep networks, the degrees of freedom is generally much less than the number of parameters in the model, and deeper networks tend to have less degrees of freedom.

— Degrees of Freedom in Deep Neural Networks, 2016.

As such, there is a growing trend by statisticians and machine learning practitioners to move away from degrees of freedom for both a proxy for model complexity and as an expectation for overfitting.

To most applied statisticians, a fitting procedure’s degrees of freedom is synonymous with its model complexity, or its capacity for overfitting to data. […] We argue that, on the contrary, model complexity and degrees of freedom may correspond very poorly.

— Effective Degrees Of Freedom: A Flawed Metaphor, 2013.

Summary

In this post, you discovered degrees of freedom in statistics and machine learning.

Specifically, you learned:

Degrees of freedom generally represents the number of points of control of a system.
In statistics, degrees of freedom is the number of observations used to calculate a statistic.
In machine learning, degrees of freedom is the number of parameters of a model.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

14 Responses to A Gentle Introduction to Degrees of Freedom in Machine Learning

Juan Jesus April 24, 2020 at 5:28 pm #

Hi Jason, congratulations for your website, I think that it’s really useful for scientists, developers and simply hobbyists. I would like to ask you if you had a dataset which contains only images, could you calculate the degrees of freedom from it? Thanks in advance.

Reply
- Jason Brownlee April 25, 2020 at 6:42 am #
  
  Thanks.
  
  What do you mean by degrees of freedom for a dataset or data?
  
  Reply
  - KVS Setty May 27, 2020 at 12:32 pm #
    
    Degrees of freedom is related to the model and not to dataset or data
    
    Reply
    - Jason Brownlee May 27, 2020 at 1:32 pm #
      
      Thanks.
      
      Reply
Iñigo April 26, 2020 at 6:43 pm #

Hello Jason, I have the idea that if you have a system of equations with less equations than variables to calculate, you had infinite results possible. And therefore, there is no solution possible.

In this article I understood that if you have more parameters to estimate than rows of data, you will overfit. But you will still be able to calculate the parameters.

As I imagine, you are talking here about reusing the same data many times and therefore actually having enough data to train, but overfitting because of this. Am I right?

Keep up the good work!

Reply
- Jason Brownlee April 27, 2020 at 5:32 am #
  
  Maybe.
  
  Reply
Pranab Narayan Jha April 29, 2020 at 4:24 pm #

Thank you for the post, Jason. I enjoy your emails and posts. Keep up the good work.

Reply
- Jason Brownlee April 30, 2020 at 6:33 am #
  
  You’re very welcome!
  
  Reply
SiHun Lee September 29, 2020 at 3:23 am #

Hello Dr. Jason,
I am wondering what would be the best way to perform interpolation/prediction for multi-variate dataset on multiple parameters including time.

I initially thought of using LSTM and embedding other parameters in the ‘features’ along with variables but had no luck.

The nature of multi-variables according to parameters are quite nonlinear except for time.

Any suggestions on the model I should be using?

Reply
- Jason Brownlee September 29, 2020 at 5:43 am #
  
  The best way is to test a suite of techniques and discover what works best for your specific dataset:
  https://machinelearningmastery.com/how-to-develop-a-skilful-time-series-forecasting-model/
  
  Reply
Gicela February 24, 2022 at 5:02 am #

Hello Jason, Thank you for the post.

Please, Could you help me. I need to fit the metamodels. I am using Root mean square error of aproximation (RMSEA (ε) = raiz(X^2-df)/df(N-1), because the Values of ε smaller than 0.05 are typically considered to indicate good fit (Browne & Cudeck, 1993). Browne and Cudeck (1993) further recommended interpreting values ranging from 0.05 to 0.08 as fair model fit, and values greater than 0.10 as poor fit. MacCallum et al. (1996) suggested that values in the range from 0.08 to 0.10 indicate mediocre fit.
the formulation RMSEA require de (Degrees of freedom (df), how to calculate df , in this case?
Thanks.

Reply
- James Carmichael February 24, 2022 at 12:46 pm #
  
  Hi Gicela…Please see my email response to your question.
  
  Reply
Reeta Sahu July 17, 2023 at 2:44 pm #

Hi! I think the degree of freedom of mean should be no. of observations that means in the above example there should be 50. Because for calculating mean no other statistic is required and another reason could be if you have one data point you can say average is that single point.

Reply
- James Carmichael July 18, 2023 at 7:36 am #
  
  Hi Reeta…The following resource presents a more detailed mathematical description:
  
  https://www.statisticshowto.com/probability-and-statistics/hypothesis-testing/degrees-of-freedom/
  
  Reply

Navigation

A Gentle Introduction to Degrees of Freedom in Machine Learning

Overview

Degrees of Freedom

Degrees of Freedom in Statistics

Degrees of Freedom in Machine Learning

Degrees of Freedom for a Linear Regression Model

Degrees of Freedom for Linear Regression Error

Total Degrees of Freedom for Linear Regression

Negative Degrees of Freedom

Degrees of Freedom and Overfitting

Further Reading

Papers

Books

Articles

Summary

Get a Handle on Statistics for Machine Learning!

Develop a working understanding of statistics

Discover how to Transform Data into Knowledge

More On This Topic

14 Responses to A Gentle Introduction to Degrees of Freedom in Machine Learning

Leave a Reply Click here to cancel reply.