 # Linear Discriminant Analysis for Machine Learning

Last Updated on August 15, 2020

Logistic regression is a classification algorithm traditionally limited to only two-class classification problems.

If you have more than two classes then Linear Discriminant Analysis is the preferred linear classification technique.

In this post you will discover the Linear Discriminant Analysis (LDA) algorithm for classification predictive modeling problems. After reading this post you will know:

• The limitations of logistic regression and the need for linear discriminant analysis.
• The representation of the model that is learned from data and can be saved to file.
• How the model is estimated from your data.
• How to make predictions from a learned LDA model.
• How to prepare your data to get the most from the LDA model.

This post is intended for developers interested in applied machine learning, how the models work and how to use them well. As such no background in statistics or linear algebra is required, although it does help if you know about the mean and variance of a distribution.

LDA is a simple model in both preparation and application. There is some interesting statistics behind how the model is setup and how the prediction equation is derived, but is not covered in this post.

Kick-start your project with my new book Master Machine Learning Algorithms, including step-by-step tutorials and the Excel Spreadsheet files for all examples.

Let’s get started. Linear Discriminant Analysis for Machine Learning
Photo by Jamie McCaffrey, some rights reserved.

## Limitations of Logistic Regression

Logistic regression is a simple and powerful linear classification algorithm. It also has limitations that suggest at the need for alternate linear classification algorithms.

• Two-Class Problems. Logistic regression is intended for two-class or binary classification problems. It can be extended for multi-class classification, but is rarely used for this purpose.
• Unstable With Well Separated Classes. Logistic regression can become unstable when the classes are well separated.
• Unstable With Few Examples. Logistic regression can become unstable when there are few examples from which to estimate the parameters.

Linear Discriminant Analysis does address each of these points and is the go-to linear method for multi-class classification problems. Even with binary-classification problems, it is a good idea to try both logistic regression and linear discriminant analysis.

## Representation of LDA Models

The representation of LDA is straight forward.

It consists of statistical properties of your data, calculated for each class. For a single input variable (x) this is the mean and the variance of the variable for each class. For multiple variables, this is the same properties calculated over the multivariate Gaussian, namely the means and the covariance matrix.

These statistical properties are estimated from your data and plug into the LDA equation to make predictions. These are the model values that you would save to file for your model.

Let’s look at how these parameters are estimated.

## Get your FREE Algorithms Mind Map Sample of the handy machine learning algorithms mind map.

I've created a handy mind map of 60+ algorithms organized by type.

## Learning LDA Models

1. That your data is Gaussian, that each variable is is shaped like a bell curve when plotted.
2. That each attribute has the same variance, that values of each variable vary around the mean by the same amount on average.

With these assumptions, the LDA model estimates the mean and variance from your data for each class. It is easy to think about this in the univariate (single input variable) case with two classes.

The mean (mu) value of each input (x) for each class (k) can be estimated in the normal way by dividing the sum of values by the total number of values.

muk = 1/nk * sum(x)

Where muk is the mean value of x for the class k, nk is the number of instances with class k. The variance is calculated across all classes as the average squared difference of each value from the mean.

sigma^2 = 1 / (n-K) * sum((x – mu)^2)

Where sigma^2 is the variance across all inputs (x), n is the number of instances, K is the number of classes and mu is the mean for input x.

## Making Predictions with LDA

LDA makes predictions by estimating the probability that a new set of inputs belongs to each class. The class that gets the highest probability is the output class and a prediction is made.

The model uses Bayes Theorem to estimate the probabilities. Briefly Bayes’ Theorem can be used to estimate the probability of the output class (k) given the input (x) using the probability of each class and the probability of the data belonging to each class:

P(Y=x|X=x) = (PIk * fk(x)) / sum(PIl * fl(x))

Where PIk refers to the base probability of each class (k) observed in your training data (e.g. 0.5 for a 50-50 split in a two class problem). In Bayes’ Theorem this is called the prior probability.

PIk = nk/n

The f(x) above is the estimated probability of x belonging to the class. A Gaussian distribution function is used for f(x). Plugging the Gaussian into the above equation and simplifying we end up with the equation below. This is called a discriminate function and the class is calculated as having the largest value will be the output classification (y):

Dk(x) = x * (muk/siga^2) – (muk^2/(2*sigma^2)) + ln(PIk)

Dk(x) is the discriminate function for class k given input x, the muk, sigma^2 and PIk are all estimated from your data.

## How to Prepare Data for LDA

This section lists some suggestions you may consider when preparing your data for use with LDA.

• Classification Problems. This might go without saying, but LDA is intended for classification problems where the output variable is categorical. LDA supports both binary and multi-class classification.
• Gaussian Distribution. The standard implementation of the model assumes a Gaussian distribution of the input variables. Consider reviewing the univariate distributions of each attribute and using transforms to make them more Gaussian-looking (e.g. log and root for exponential distributions and Box-Cox for skewed distributions).
• Remove Outliers. Consider removing outliers from your data. These can skew the basic statistics used to separate classes in LDA such the mean and the standard deviation.
• Same Variance. LDA assumes that each input variable has the same variance. It is almost always a good idea to standardize your data before using LDA so that it has a mean of 0 and a standard deviation of 1.

## Extensions to LDA

Linear Discriminant Analysis is a simple and effective method for classification. Because it is simple and so well understood, there are many extensions and variations to the method. Some popular extensions include:

• Quadratic Discriminant Analysis (QDA): Each class uses its own estimate of variance (or covariance when there are multiple input variables).
• Flexible Discriminant Analysis (FDA): Where non-linear combinations of inputs is used such as splines.
• Regularized Discriminant Analysis (RDA): Introduces regularization into the estimate of the variance (actually covariance), moderating the influence of different variables on LDA.

The original development was called the Linear Discriminant or Fisher’s Discriminant Analysis. The multi-class version was referred to Multiple Discriminant Analysis. These are all simply referred to as Linear Discriminant Analysis now.

This section provides some additional resources if you are looking to go deeper. I have to credit the book An Introduction to Statistical Learning: with Applications in R, some description and the notation in this post was taken from this text, it’s excellent.

## Summary

In this post you discovered Linear Discriminant Analysis for classification predictive modeling problems. You learned:

• The model representation for LDA and what is actually distinct about a learned model.
• How the parameters of the LDA model can be estimated from training data.
• How the model can be used to make predictions on new data.
• How to prepare your data to get the most from the method.

## Discover How Machine Learning Algorithms Work! #### See How Algorithms Work in Minutes

...with just arithmetic and simple examples

Discover how in my new Ebook:
Master Machine Learning Algorithms

It covers explanations and examples of 10 top algorithms, like:
Linear Regression, k-Nearest Neighbors, Support Vector Machines and much more...

### 50 Responses to Linear Discriminant Analysis for Machine Learning

1. Shaksham Kapoor June 6, 2017 at 6:32 pm #

I’m not able to understand these equations :

P(Y=x|X=x) = (PIk * fk(x)) / sum(PIl * fl(x))

PIk = nk/n ……. I know what Baye’s theorm is but what does fk(x), PII and fl(x) represent ?

2. Pia Laine August 9, 2017 at 1:58 am #

Hi Shaksham,

this is probably late but if you use the notation:

P(Y=y|X=x) = P(X=x|Y=y) * P(Y=y) / P(X=x),

you can see that fk(x) stands for P(X=x|Y=y). I think in the denominator we are just summing up the stuff across classes. The original notation seems a bit confusing to me also since in some context I’ve seen the Greek letter ‘pi’ to be used for the posterior probability itself.

3. Pia Laine August 9, 2017 at 1:59 am #

Hi Jason,

I’m having hard time understanding the term n-K in the variance equation:

sigma^2 = 1 / (n-K) * sum((x – mu)^2)

Could you clarify?

Thanks!

• Jason Brownlee August 9, 2017 at 6:43 am #

“n is the number of instances, K is the number of classes”.

• Pia Laine August 9, 2017 at 3:40 pm #

Yep, thanks, I noticed that but ‘minus K’ just didn’t seem intuitive to me. 🙂

• Aleksa Mihajlovic May 24, 2020 at 6:01 am #

N-k, we take out k degrees of freedom, because we have k classes (unless I am wrong)

4. Madeleine October 12, 2017 at 8:56 pm #

Dear Jason,

I have 2 questions regarding the case of p predictors > 1.

1. How do we estimate mu for each K class when we have more than one predictor?
2. How do we estimate the pxp common covariance matrix for all K groups?

Thank you very much in advance,
I hope my question is clear enough

best,

• Jason Brownlee October 13, 2017 at 5:47 am #

Great question. I’d recommend a good textbook, perhaps start with: An Introduction to Statistical Learning.

5. Aniket Saxena January 9, 2018 at 2:19 pm #

Hi Jason,

Can you please refer me some stuff from where I can learn flexible and regularized discriminant analysis?

Best,
Aniket

• Jason Brownlee January 9, 2018 at 3:20 pm #

Perhaps a good textbook?

6. om May 29, 2018 at 8:02 am #

He is talking about different use of LDA.

Jason do you want to comment ?

• Jason Brownlee May 29, 2018 at 2:50 pm #

7. statAstrologer October 12, 2018 at 6:42 pm #

Hello sir , I wanted to know why do we call this classification technique as “analysis” ?

• Jason Brownlee October 13, 2018 at 6:08 am #

It is just an old name for the method.

8. SIMM October 24, 2018 at 1:38 pm #

In fisher linear discriminant, how can we classify a new sample?

• Jason Brownlee October 24, 2018 at 2:49 pm #

See the section titled “Making Predictions with LDA”.

9. manef November 4, 2018 at 12:21 am #

Hi sure, can you tell me the different between the fisher’s discriminant analyse FDA and the linear discriminant analysis ?
Thnaks

• Jason Brownlee November 4, 2018 at 6:27 am #

Not off hand, sorry.

10. Hemanga November 26, 2018 at 4:26 pm #

what is the value of label in linear discriminent analysis (LDA) coding shoul be taken

• Jason Brownlee November 27, 2018 at 6:31 am #

Sorry, I don’t follow, perhaps you can elaborate or rephrase your question?

11. Vishal July 6, 2019 at 4:40 am #

Hi,
As written in your articles, Unstable With Well Separated Classes. Logistic regression can become unstable when the classes are well separated.

Can you please tell me what does well separated classed mean?

• Jason Brownlee July 6, 2019 at 8:44 am #

Linear separation refers to the ability to separate instance instances by class using a line or hyperplane in input feature space.

12. Yihan Ma July 15, 2019 at 7:38 pm #

1. Based on my understanding, for classification, training data and testing data should be separated. When reducing the dimension by LDA, I should combine training data and testing data together to reduce dimension, or just reduce training data dimension, and use eigenvector W to map testing data to lower dimension?

2. For standardized you mentioned in this article, I should standardize the whole data together, or just standardized the training data, and use the same scale mapping the testing data?

• Jason Brownlee July 16, 2019 at 8:15 am #

Ideally, any data preparation is calculated using the training data only, then applied/used to prepare train and test data, e.g. calculating mean/stdev/etc.

13. Yihan Ma July 16, 2019 at 5:37 pm #

So you mean I should just use training data to acquire eigenvector W, and mapping test data by W to reduce dimension?

And for the standardized dataset, I also should separate considering the training and test dataset.

• Jason Brownlee July 17, 2019 at 8:20 am #

I believe so.

14. Astarag Mohapatra September 5, 2019 at 10:45 pm #

Why Logistic Regression is unstable with well-separated classes?

15. Amol September 16, 2019 at 2:16 pm #

i do have a question regarding our discriminant function calculation :-

Dk(x) = x * (muk/siga^2) – (muk^2/(2*sigma^2)) + ln(PIk)

for e.g. if i have a 2 class problem and have 5 rows / instances of data which does comprise of 4 features.
I would need to find Dk(x) with 2 classes for all 5 rows that provides me with 2 values of discriminant function for each row. For a single feature in row i can calculate discriminant but how should i proceed if i have more than 1 feature.
What value of x is passed in case of multi feature data to calculate discriminant function value across 2 classes.

• Jason Brownlee September 17, 2019 at 6:22 am #

Good question, perhaps reference the description in “An Introduction to Statistical Learning with Applications in R”
https://amzn.to/34Onv5J

16. Sam December 15, 2019 at 6:52 pm #

Hello Sir, Can we use LDA when our independent variables are categorical ?

Thanks,
Sam

• Jason Brownlee December 16, 2019 at 6:15 am #

No, LDA assume numerical input variables.

17. Allan May 6, 2020 at 8:36 am #

Hi, is that correct that LDA/FDA can only generate 2 output?

• Jason Brownlee May 6, 2020 at 1:36 pm #

No, LDA inherently multi-class.

• Allan May 6, 2020 at 7:18 pm #

How about in the context of dimensional reduction method?

Suppose I have 100 features, I want to reduce to 5 features. Is LDA/FDA only generate 2 output? (my supervisor said this)

• Jason Brownlee May 7, 2020 at 6:46 am #

Great question – but no difference.

I have a tutorial written and scheduled on exactly this topic due to be published in a few days. Keep an eye on the blog.

• Allan May 7, 2020 at 7:37 pm #

Is this the topic you mean https://machinelearningmastery.com/dimensionality-reduction-for-machine-learning/
Let me know if not correct, and any update.
(I am not good in searching a new layout)

• Jason Brownlee May 8, 2020 at 6:29 am #

No, I have a tutorial on LDA for dimensionality reduction scheduled for next week.

• Allan May 13, 2020 at 5:49 am #

• Allan May 13, 2020 at 5:54 am #
• Jason Brownlee May 13, 2020 at 6:46 am #

Yes.

18. Emma Wileman May 7, 2020 at 8:44 am #

Can I know that in the context of dimensionality reduction using LDA/FDA. LDA/FDA can start with “n” dimensions and end with k dimensions, where “k” less than “n”.  or  The output is “c-1” where “c” is the number of classes and the dimensionality of the data is n with “n>c”.

• Jason Brownlee May 7, 2020 at 11:50 am #

Yes, you can use LDA for dimensionality reduction and the number of resulting dimensions can be chosen as a parameter, less than the number of classes.

• Emma Wileman May 7, 2020 at 6:38 pm #

Thanks. Is that correct: The output of LDA is “c-1” where “c” is the number of classes and the dimensionality of the data is n with “n > c”.

Let say my original dataset has 2 classes, the output will be 1 dimensionality ( 2 – 1 =1 ), likewise, if my original dataset has 5 classes, the output will be 4 dimensionality.

• Jason Brownlee May 8, 2020 at 6:29 am #
• Emma Wileman May 8, 2020 at 6:59 am #

It said: In LDA, the number of linear discriminants is at most c−1 where c is the number of class labels, since the in-between scatter matrix SB is the sum of c matrices with rank 1 or less.

I think is correct: Let say my original dataset has 2 classes, the output will be 1 dimensionality ( 2 – 1 =1 ), likewise, if my original dataset has 5 classes, the output will be 4 dimensionality.

Can we have a discussion on this? I just want to make sure I am not confused.

19. Md Azharuddin May 30, 2020 at 10:19 pm #

Hello Jason, I wanted to know what if there are multiple features in my dataset (X1, X2, X3,…) then how am I supposed to calculate the discriminate, as the discrminate function expects a single ‘X’?

20. Ben Specter July 1, 2020 at 4:47 pm #

Hi Jason, how to apply the LDA algorithm in online learning? For example, I want to be able to train the model with some initial training data and then update it with new data points

• Jason Brownlee July 2, 2020 at 6:15 am #

Perhaps custom code and you can incrementally re-estimate the coefficients of the model as new data comes in?

Or just refit the model each time a new block of data with known targets becomes available.