A downside of K-Nearest Neighbors is that you need to hang on to your entire training dataset.
The Learning Vector Quantization algorithm (or LVQ for short) is an artificial neural network algorithm that lets you choose how many training instances to hang onto and learns exactly what those instances should look like.
In this post you will discover the Learning Vector Quantization algorithm. After reading this post you will know:
- The representation used by the LVQ algorithm that you actually save to a file.
- The procedure that you can use to make predictions with a learned LVQ model.
- How to learn an LVQ model from training data.
- The data preparation to use to get the best performance from the LVQ algorithm.
- Where to look for more information on LVQ.
This post was written for developers and assumes no background in statistics or mathematics. The post focuses on how the algorithm works and how to use it for predictive modeling problems.
If you have any questions on LVQ, leave a comment and I will do my best to answer.
Kick-start your project with my new book Master Machine Learning Algorithms, including step-by-step tutorials and the Excel Spreadsheet files for all examples.
Let’s get started.
LVQ Model Representation
The representation for LVQ is a collection of codebook vectors.
LVQ was developed and is best understood as a classification algorithm. It supports both binary (two-class) and multi-class classification problems.
A codebook vector is a list of numbers that have the same input and output attributes as your training data. For example, if your problem is a binary classification with classes 0 and 1, and the inputs width, length height, then a codebook vector would be comprised of all four attributes: width, length, height and class.
The model representation is a fixed pool of codebook vectors, learned from the training data. They look like training instances, but the values of each attribute have been adapted based on the learning procedure.
In the language of neural networks, each codebook vector may be called a neuron, each attribute on a codebook vector is called a weight and the collection of codebook vectors is called a network.
Get your FREE Algorithms Mind Map
I've created a handy mind map of 60+ algorithms organized by type.
Download it, print it and use it.
Also get exclusive access to the machine learning algorithms email mini-course.
Making Predictions with an LVQ Model
Predictions are made using the LVQ codebook vectors in the same way as K-Nearest Neighbors.
Predictions are made for a new instance (x) by searching through all codebook vectors for the K most similar instances and summarizing the output variable for those K instances. For classification this is the mode (or most common) class value.
Typically predictions are made with K=1, and the codebook vector that matches is called the Best Matching Unit (BMU).
To determine which of the K instances in the training dataset are most similar to a new input a distance measure is used. For real-valued input variables, the most popular distance measure is Euclidean distance. Euclidean distance is calculated as the square root of the sum of the squared differences between a new point (x) and an existing point (xi) for each attribute j.
EuclideanDistance(x, xi) = sqrt( sum( (xj – xij)^2 ) )
Learning an LVQ Model From Data
The LVQ algorithm learns the codebook vectors from the training data.
You must choose the number of codebook vectors to use, such as 20 or 40. You can find the best number of codebook vectors to use by testing different configurations on your training dataset.
The learning algorithm starts with a pool of random codebook vectors. These could be randomly selected instances from the training data, or randomly generated vectors with the same scale as the training data. Codebook vectors have the same number of input attributes as the training data. They also have an output class variable.
The instances in the training dataset are processed one at a time. For a given training instance, the most similar codebook vector is selected from the pool.
If the codebook vector has the same output as the training instance, the codebook vector is moved closer to the training instance. If it does not match, it is moved further away. The amount that the vector is moved is controlled by an algorithm parameter called the learning_rate.
For example, the input variable (x) of a codebook vector is moved closer to the training input value (t) by the amount in the learning_rate if the classes match as follows:
x = x + learning_rate * (t – x)
The opposite case of moving the input variables of a codebook variable away from a training instance is calculated as:
x = x – learning_rate * (t – x)
This would be repeated for each input variable.
Because one codebook vector is selected for modification for each training instance the algorithm is referred to as a winner-take-all algorithm or a type of competitive learning.
This process is repeated for each instance in the training dataset. One iteration of the training dataset is called an epoch. The process is completed for a number of epochs that you must choose (max_epoch), such as 200.
You must also choose an initial learning rate (such as alpha=0.3). The learning rate is decreased with the epoch, starting at the large value you specify at epoch 1 which makes the most change to the codebook vectors and finishing with a small value near zero on the last epoch, making very minor changes to the codebook vectors.
The learning rate for each epoch is calculated as:
learning_rate = alpha * (1 – (epoch/max_epoch))
Where learning_rate is the learning rate for the current epoch (0 to max_epoch-1), alpha is the learning rate specified to the algorithm at the start of the training run and max_epoch is the total number of epochs to run the algorithm also specified at the start of the run.
The intuition for the learning process is that the pool of codebook vectors is a compression of the training dataset to the points that best characterize the separation of the classes.
Data Preparation for LVQ
Generally, it is a good idea to prepare data for LVQ in the same way as you would prepare it for K-Nearest Neighbors.
- Classification: LVQ is a classification algorithm that works for both binary (two-class) and multi-class classification algorithms. The technique has been adapted for regression.
- Multiple-Passes: Good technique with LVQ involves performing multiple passes of the training dataset over the codebook vectors (e.g. multiple learning runs). The first with a higher learning rate to settle the pool codebook vectors and the second run with a small learning rate to fine tune the vectors.
- Multiple Best Matches: Extensions of LVQ select multiple best matching units to modify during learning, such as one of the same class and one of a different class which are drawn toward and away from a training sample respectively. Other extensions use a custom learning rate for each codebook vector. These extensions can improve the learning process.
- Normalize Inputs: Traditionally, inputs are normalized (rescaled) to values between 0 and 1. This is to avoid one attribute from dominating the distance measure. If the input data is normalized, then the initial values for the codebook vectors can be selected as random values between 0 and 1.
- Feature Selection: Feature selection that can reduce the dimensionality of the input variables can improve the accuracy of the method. LVQ suffers from the same curse of dimensionality in making predictions as K-Nearest Neighbors.
LVQ Tutorial in Python
For a step-by-step tutorial on implementing LVQ from scratch in Python, see the post;
Further Reading
The technique was developed by Kohonen who wrote the seminal book on LVQ and the sister method Self-Organizing Maps called: Self-Organizing Maps.
I highly recommend this book if you are interested in LVQ.
- Learning Vector Quantization on Wikipedia.
- Learning Vector Quantization chapter from my book Nature Inspired Algorithms.
- LVQ_PAK the official software implementation of LVQ by Kohonen.
- LVQ as plug-in for WEKA (that I created years ago).
Summary
In this post you discovered the LVQ algorithm. You learned:
- The representation for LVQ is a small pool of codebook vectors, smaller than the training dataset.
- The codebook vectors are used to make predictions using the same technique as K-Nearest Neighbors.
- The codebook vectors are learned from the training dataset by moving them closer when they are good match and further away when they are a bad match.
- The codebook vectors are a compression of the training data to best separate the classes.
- Data preparation traditionally involves normalizing the input values to the range between 0 and 1.
Do you have any questions about this post or the LVQ algorithm? Leave a comment and ask your question and I will do my best to answer it.
Hi,
nice and readable post. If I am correct, then the update rule should use ‘learning_rate’ instead of the initial value ‘alpha’. In the current form ‘learning_rate’ is decreased but never applied. Or replace ‘learning_rate’ by ‘alpha’ and introduce something like ‘alpha0’ in the update formula of the learning rate.
Please delete or do not publish this post.
Best
aloo
Alpha is a fixed parameter and learning rate is what is used to update the weights.
The update rule of codebook vector x uses the fixed parameter alpha rather than the adaptive learning_rate parameter.
Thanks aloo masala, I have updated the example.
You are welcome. Now you can delete my comments.
Kohonen has recently self-published a book on this topic. “MATLAB Implementations and Applications of the Self-Organizing Map”
Although its examples are using MATLAB, I think this may be a good option for self-study, since it’s free!
http://docs.unigrafia.fi/publications/kohonen_teuvo/
Cool, I had not seen this. Thanks Tom.
I’ve got a network trained with weights being initialised (this is in Java), what’s the best way to feed in a new data set/vector and find the best matching dataset to it?
Thanks,
Declan
Sorry, I don’t understand your question Declan. Perhaps you could restate it?
I have a network with training sets that have a vector and a name (classification name) in it, i’m trying to get a new vector, that has no name, and determine what it’s closest to.
I’m trying to make a prediction using a new data set to predict what it’s most similar to.
You might find this post useful:
https://machinelearningmastery.com/implement-learning-vector-quantization-scratch-python/
I took a look at the post, and have tried to implement a similar thing to your first test. It trains the vectors and such correctly, but i’m not sure how i can test a new dataset that hasn’t been classified against it?
Do people really use LVQ and locally weighted learning in practice, in 2017?
I like to throw LVQ in the mix when spot checking algorithms on a problem and go with it if it products good results or results similar to a kNN.
Best Machine Learning Classification Algorithms using MATLAB tutorial
https://www.simpliv.com/developmenttool/machine-learning-classification-algorithms-using-matlab
Thanks for the link.
Hi there! Let’s assume that we are encoding to a collection of K codebook vectors (for example K=512). We train the system on our data and then want to use it.
Is it correct that the total number of possible outputs of the network is K? IE if I train the network on 100,000 128x128x3 images with K=512, after training ANY 128x128x3 image will get mapped to one of 512 outputs?
Not sure I follow. The output of the model is a summary of the BMUs.
what happens if the number of prototypes(codebooks) per class is equal to the number of data points we have ?
In that case only use LVQ if it outperforms kNN. It may.
Hii,
Thanks for your incredible site.
Like you say in your post KNN ” need to hang on to your entire training dataset”
I want to use LVQ , but can I use LVQ as unsupervised algorithm (or one class classification) ?
I have dataset with normal behavior , and I want to predict when new item is abnormal .
Thanks,
MAK
Perhaps try it.
Hii,
Can you give me a hint….
If I have codebook vectors from the same class , how LVQ will work ?
According to your explain LVQ can be only binary/multi-class .
10x
I do not have an example, sorry. Perhaps you can use a threshold of a distance measure for an example from the codebook vectors, as you would in a KNN model for one-class prediction.
Thank you so much.
I have a question:
After the updating the codebook (i.e., x = x + learning_rate * (t – x)), some elements in the codebook
may not belong to the training data.
How to define the classes for those elements?
Classes are assigned to new instances by finding the BMU and using it’s class.
Thank you so much for the kindly response.
For each element in the updated codebook, if its class is determined by
the class of its BMU, it is possible that the class of the element changes
along the iteration, right?
If it is right, the final codebook may be class-inbalanced.
Generally no. You can try it if you like.
Hi Jason, thanks a lot for your great, readable summary! I have three quick follow-up questions and would be glad about your opinion on those. Also happy about any pointers to other references if this helps answer the question.
1. Can LVQ output a probability instead of a class? E.g. if I have two classes (coded as 0 vs. 1), could LVQ output probabilities like 0.30 or 0.60 instead of just 0s and 1s? I am asking because I tried using an LVQ implementation in R but it seems like it can just provide the class itself as output with the lvqtest function (https://www.rdocumentation.org/packages/class/versions/7.3-14/topics/lvq3). So, I am wondering if this is just because of the implementation in this R package or if LVQ cannot provide probabilities in general.
2. Would you say that LVQ is a suitable algorithm for data sets with >= 100 predictor variables and >= 500.000 observations? I have the feeling that this is not the case as I got stuck with LVQ in R (using the before mentioned implementation) with a data set of this size. Also, it seems like LVQ somehow still has similarities to KNN which is definitely not suitable for data sets of this size.
3. You said that the input values should be normalised to a range between 0 and 1. Is it enough to just normalise the values to get a mean of 0 and a sd of 1? (Basically, I am asking if it is enough to stop after step 3 of the description by Stephen Joy in the first answer of this forum post: https://www.researchgate.net/post/How_do_i_normalize_data_from_0_to_1_range). If the LVQ implementation that I am using does not specify that values in the range between 0 and 1 are required, this should be okay right?
Thanks a lot in advance
Not directly or natively, you may be able to use methods from SVM to convert the output into a probability.
LVQ like kNN will suffer from the curse of dimensionality. Lots of features may be problematic.
Standardizing is fine.
Hi!
First thanks a lot for your nice article and very useful site in general 🙂 !
I’m implementing SOM (for clustering/visualisation), and LVQ (sort of SOM classifier) in Python.
And I wonder if there exists any method to select/choose the number of neurons at the beginning.
For example, if we take the Iris datset, we have three classes of different flowers.
So, if I’m understanding well (I’ve read lots of blogs and article, including the many of Kohonen), for a LVQ model, we choose a number of neurons, let’s say 10, and give them a class. And the LVQ model is a one dimensional (the grid of neurons is flat, unlike in SOM that could be in 2-d, 3-d…).
So for example, we give the first three neurons the class setosa, then the versicolor to the let’s say 4th and 5th neurons, and finally the rest is virginica.
Then we initialise the vectors of the neurons, by randomness, or by taking let’s say a gaussian with mean and standard-deviation of the class the neurons was attributed to.
Then we train the model.
AM I kind of correct? Because it seems weird to me that we need to attribute the class to the neurons BEFORE the training..
Some articles say that clustering models, like k-means, dbscan or even SOM, could be applied before, but I didn’t really understant how?
One person on a forum even say, that for classify with a SOM model, we could create as many SOM as we have classes, so with Iris dateset, three SOM. Then train them separately, one only on only one class. SO there would be one SOM for setose, one for virginica…
Then to classify a new instance, we look at the BMU of all the SOM, and predict the class of the BMU. So it’s kind of a Best Matching Map and no longer a Best Matching Unit… I tried it, results were similar to random forest with Iris, but with bigger and more complicated dataset, it’s clearly not…
I will put my “work”, implentation of SOM and LVQ in github, because I didn’t find any really good implementation yet… (but for weirdly for neural gas and growing neural gas, there is more stuff..)
But I would have like to get some of your useful advice before if you have time 😉
Thanks again!
I recommend testing many configurations and see what works best for your specific dataset, we cannot know the best config before hand.
Generally, if knn works well, LVQ works well. I’ve not had good experiences with som for supervised learning.
Also, this may be useful:
http://cleveralgorithms.com/nature-inspired/neural/som.html
Isnt LVQ a self organizing map! Here a data point is choose random from training pts for each class and that pt is moved to center of its class. Is my observation right?
No, the nodes are not connected as they are in a SOM.
Hi I have a question. In the sub topic “Making Predictions with an LVQ Model” what is meant by K? And why do we usually take K as 1? Response is highly appreciated!
K is the number of best matching units for a new example.
Can LVQ be extended for clustering approaches as well?
You mean kohonen map/SOM? It’s a different but related method.
How target values are used while training LVQ?
How code vectors are obtained?
what is the difference between LVQ and VQ which one of them is unsupervised and which one is supervised ?
I think it is the best to consider VQ as the result of KNN while LVQ is how you get the VQ.
Hello.
What would be the best way to remove weight vectors that no longer win battles?
Is there a general rule for the starting point of selecting the number of codebooks? E.g. Based on the dimensionality of the data?
And is there a generalization for the trade off between tuning the number of codebooks and the learning rate?
Hi Aaron…the following resource may be of interest to you:
https://machinelearningmastery.com/implement-learning-vector-quantization-scratch-python/
Specifically for binary classification^^