Gentle Introduction to Vector Norms in Machine Learning

Calculating the length or magnitude of vectors is often required either directly as a regularization method in machine learning, or as part of broader vector or matrix operations.

In this tutorial, you will discover the different ways to calculate vector lengths or magnitudes, called the vector norm.

After completing this tutorial, you will know:

  • The L1 norm that is calculated as the sum of the absolute values of the vector.
  • The L2 norm that is calculated as the square root of the sum of the squared vector values.
  • The max norm that is calculated as the maximum vector values.

Kick-start your project with my new book Linear Algebra for Machine Learning, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

  • Update Mar/2018: Fixed typo in max norm equation.
  • Update Sept/2018: Fixed typo related to the size of the vectors defined.
Gentle Introduction to Vector Norms in Machine Learning

Gentle Introduction to Vector Norms in Machine Learning
Photo by Cosimo, some rights reserved.

Tutorial Overview

This tutorial is divided into 4 parts; they are:

  1. Vector Norm
  2. Vector L1 Norm
  3. Vector L2 Norm
  4. Vector Max Norm

Need help with Linear Algebra for Machine Learning?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Vector Norm

Calculating the size or length of a vector is often required either directly or as part of a broader vector or vector-matrix operation.

The length of the vector is referred to as the vector norm or the vector’s magnitude.

The length of a vector is a nonnegative number that describes the extent of the vector in space, and is sometimes referred to as the vector’s magnitude or the norm.

— Page 112, No Bullshit Guide To Linear Algebra, 2017

The length of the vector is always a positive number, except for a vector of all zero values. It is calculated using some measure that summarizes the distance of the vector from the origin of the vector space. For example, the origin of a vector space for a vector with 3 elements is (0, 0, 0).

Notations are used to represent the vector norm in broader calculations and the type of vector norm calculation almost always has its own unique notation.

We will take a look at a few common vector norm calculations used in machine learning.

Vector L1 Norm

The length of a vector can be calculated using the L1 norm, where the 1 is a superscript of the L, e.g. L^1.

The notation for the L1 norm of a vector is ||v||1, where 1 is a subscript. As such, this length is sometimes called the taxicab norm or the Manhattan norm.

The L1 norm is calculated as the sum of the absolute vector values, where the absolute value of a scalar uses the notation |a1|. In effect, the norm is a calculation of the Manhattan distance from the origin of the vector space.

The L1 norm of a vector can be calculated in NumPy using the norm() function with a parameter to specify the norm order, in this case 1.

First, a 1×3 vector is defined, then the L1 norm of the vector is calculated.

Running the example first prints the defined vector and then the vector’s L1 norm.

The L1 norm is often used when fitting machine learning algorithms as a regularization method, e.g. a method to keep the coefficients of the model small, and in turn, the model less complex.

Vector L2 Norm

The length of a vector can be calculated using the L2 norm, where the 2 is a superscript of the L, e.g. L^2.

The notation for the L2 norm of a vector is ||v||2 where 2 is a subscript.

The L2 norm calculates the distance of the vector coordinate from the origin of the vector space. As such, it is also known as the Euclidean norm as it is calculated as the Euclidean distance from the origin. The result is a positive distance value.

The L2 norm is calculated as the square root of the sum of the squared vector values.

The L2 norm of a vector can be calculated in NumPy using the norm() function with default parameters.

First, a 1×3 vector is defined, then the L2 norm of the vector is calculated.

Running the example first prints the defined vector and then the vector’s L2 norm.

Like the L1 norm, the L2 norm is often used when fitting machine learning algorithms as a regularization method, e.g. a method to keep the coefficients of the model small and, in turn, the model less complex.

By far, the L2 norm is more commonly used than other vector norms in machine learning.

Vector Max Norm

The length of a vector can be calculated using the maximum norm, also called max norm.

Max norm of a vector is referred to as L^inf where inf is a superscript and can be represented with the infinity symbol. The notation for max norm is ||x||inf, where inf is a subscript.

The max norm is calculated as returning the maximum value of the vector, hence the name.

The max norm of a vector can be calculated in NumPy using the norm() function with the order parameter set to inf.

First, a 1×3 vector is defined, then the max norm of the vector is calculated.

Running the example first prints the defined vector and then the vector’s max norm.

Max norm is also used as a regularization in machine learning, such as on neural network weights, called max norm regularization.

Extensions

This section lists some ideas for extending the tutorial that you may wish to explore.

  • Create 5 examples using each operation using your own data.
  • Implement each matrix operation manually for matrices defined as lists of lists.
  • Search machine learning papers and find 1 example of each operation being used.

If you explore any of these extensions, I’d love to know.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Books

API

Articles

Summary

In this tutorial, you discovered the different ways to calculate vector lengths or magnitudes, called the vector norm.

Specifically, you learned:

  • The L1 norm that is calculated as the sum of the absolute values of the vector.
  • The L2 norm that is calculated as the square root of the sum of the squared vector values.
  • The max norm that is calculated as the maximum vector values.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Get a Handle on Linear Algebra for Machine Learning!

Linear Algebra for Machine Learning

Develop a working understand of linear algebra

...by writing lines of code in python

Discover how in my new Ebook:
Linear Algebra for Machine Learning

It provides self-study tutorials on topics like:
Vector Norms, Matrix Multiplication, Tensors, Eigendecomposition, SVD, PCA and much more...

Finally Understand the Mathematics of Data

Skip the Academics. Just Results.

See What's Inside

49 Responses to Gentle Introduction to Vector Norms in Machine Learning

  1. Avatar
    Hari February 13, 2018 at 11:27 pm #

    Hi Jason,

    I have a question, why are they L1 and L2. Are there any more norms like L3,L4 etc..?

    If so why are we only using L1/L2 norm in machine learning?

    Is this any way related to why we use squares of errors instead of taking absolute value of errors to minimize while optimizing?

    • Avatar
      Jason Brownlee February 14, 2018 at 8:22 am #

      I don’t know about the reasons for the names off the top of my head, sorry.

      Yes, there are nice mathematical properties for mse.

    • Avatar
      Daniel February 15, 2020 at 10:12 am #

      Hi Hari,

      The 0,1 and 2 norms are just the most used cases, but there is an infinite number.

      Formally, the l_p norm is defined as \left \| x \right \|_p = \sqrt[p]{\sum_{i}\left | x_i \right |^p} where p \epsilon \mathbb{R}

    • Avatar
      Erica June 5, 2020 at 10:35 am #

      L2 norm is named because you compute the sum of squares of the elements in your vector/matrix/tensor. L3 is the sum of cubes of individual elements, and so on and so forth. L1 is the sum of the absolute-value of the individual elements. They all are manifestations of L_p norm (which is computed from summing the individual elements each raised to the p-th power), as Daniel mentioned.

    • Avatar
      Jojocs July 27, 2021 at 11:09 pm #

      I think this can be more detailed like providing actual formula.

      like how
      L1 is actually summation {x1^p + x2^p + x3^p … xn^p } ^ 1/p when p=1.

  2. Avatar
    Russell Bigley February 16, 2018 at 3:49 am #

    just a couple of suggestions for clarity.

    While writing about the L1 norm, this line doesn’t seem necessary
    “The L2 norm of a vector can be calculated in NumPy using the norm() function with a parameter to specify the norm order, in this case 1.”

    Also, even though, not something I would do while programming in the real world, the ‘l” in l1, l2, might be better represented with capital letters L1, L2 for the python programming examples.

  3. Avatar
    Russell Bigley February 16, 2018 at 8:56 am #

    The calculation for max norm isn’t explained.

    Is it taking the the vector points [1, 0 ,0 ], [0, 2, 0], and [0, 0, 3] and finding the largest vector of the sparse vectors?

  4. Avatar
    Jeza May 10, 2018 at 9:01 pm #

    Thanks for your explanation,
    My question is how to calculate quasi-norm such as L(0.5)

  5. Avatar
    udaya July 17, 2018 at 7:20 pm #

    Different ways of finding vector norm – length of the vector – magnitude of the vector are L1,L2 and L inf. Don’t the vector norm of the same vector be same ?

    • Avatar
      Jason Brownlee July 18, 2018 at 6:32 am #

      No, there are many ways of calculating the length.

      • Avatar
        udaya July 19, 2018 at 7:34 pm #

        So how can we find the components of a vector from its magnitude and direction? Normally we use euclidean function in that case. I am confused.

  6. Avatar
    udaya July 24, 2018 at 10:36 pm #

    I got cleared my confusion. Thank you

  7. Avatar
    Saurabh Sharma August 10, 2018 at 12:37 am #

    Just wondering! why do we need to convert vectors to unit norm in ML? what is the reason behind this? Also, I was looking at an example of preprocessing in stock movement data-set and the author used preprocessing.normalizer(norm=’l2′). Any particular reason behind this? Does it have anything to do with the sparsity of the data? Sorry for too many questions.

    • Avatar
      Jason Brownlee August 10, 2018 at 6:19 am #

      We do this to keep the values in the vector small when learning (optimizing) a machine learning model, which in turn reduces the complexity of the model and results in a better model (better generalization).

  8. Avatar
    tim September 8, 2018 at 1:55 am #

    The text says ‘a 3×3 vector is defined’ but your code is defining a 1×3 vector: [1,2,3]. Can you correct your text?

  9. Avatar
    Chris September 30, 2018 at 6:45 am #

    Awesome article. Love this site.

  10. Avatar
    Efstathios Chatzikyriakidis December 1, 2018 at 2:34 pm #

    How can I calculate the L1 and L2 norms for 3D matrixes?

    e.g:

    input_shape = (10, 20, 3)

    a = np.ones(input_shape) * 2
    b = np.ones(input_shape) * 4

    x = a – b

    l1_norm_of_x = ????
    l2_norm_of_x = ????

  11. Avatar
    LikeToStay AnonyMous January 14, 2019 at 4:32 am #

    Is there any thumb rule to decide which distance metric to use for a problem ?

    • Avatar
      Jason Brownlee January 14, 2019 at 5:32 am #

      Yes, I have seen some. Mostly it comes down to your preferred outcome – e.g. what you want to capture/handle/promote in the measure.

  12. Avatar
    Mohammed Sabry January 23, 2019 at 1:49 am #

    I read that L1 norm is better than L2 at capturing small changes in model’s coefficients , L2 is increase very slowly near the origin and I didn’t understand why?

    • Avatar
      Jason Brownlee January 23, 2019 at 8:48 am #

      Perhaps ask the person that made this statement to you to see exactly what they meant?

    • Avatar
      Paul Gavrikov January 20, 2022 at 10:58 pm #

      Because for any positve x <1 you will see x^2 (L2) < 1, x^2 > x

  13. Avatar
    Manas March 5, 2019 at 5:47 pm #

    i’ve clearly understood the Norms but wanna know the behind scenes use of it in machine learning and neural networks. Can you please explain how it is used in normalization(in depth)
    Thank you in advance.

  14. Avatar
    Ana April 18, 2019 at 2:40 pm #

    Hi Jason,

    I was wondering is the L2 like the hypothenuse?
    And are you using matlab for the operation windows you are posting in this page?

  15. Avatar
    John February 7, 2020 at 9:16 pm #

    my solution to the exercise above. Great article as always

  16. Avatar
    Jack June 9, 2020 at 9:55 pm #

    Hello I have a sparse matrix with me of size 4*9 after applying Fit and Transform function ( I am newbie in ML), now I need to implement L2 norm on above matrix but when I try to use your method it doesn’t work as desired, the output is (for top row without L2 norm)
    (0, 3) 1
    (0, 6) 1
    (0, 8) 1
    (0, 2) 1
    but it should be like (0, 8) 0.38408524091481483
    (0, 6) 0.38408524091481483
    (0, 3) 0.38408524091481483
    (0, 2) 0.5802858236844359

    What wrong am I doing here? and how should I solve this problem for my matrix?
    Below is the dense matrix for reference:
    [[0 1 1 1 0 0 1 0 1]
    [0 2 0 1 0 1 1 0 1]
    [1 0 0 1 1 0 1 1 1]
    [0 1 1 1 0 0 1 0 1]]

  17. Avatar
    abid July 21, 2021 at 3:04 pm #

    Hello sir,

    I would like know weather can someone use the vector max norm { in deep hashing loss function}? As some researcher have used L2 norm in their loss function .Thanks

  18. Avatar
    Erfan January 3, 2022 at 5:51 pm #

    ||W|| = 1.
    what does it mean????

  19. Avatar
    Kartik February 13, 2022 at 7:12 pm #

    Do the vectors need to be unit vector to use L1/L2 norm?
    If yes then why is it so?

  20. Avatar
    Ben February 28, 2022 at 3:43 pm #

    Hi Jason, love your blog! Ive begun playing around with ML in C++. Regarding L1 and L2 normalisation, are these values just scaled (alpha and beta) and applied in the gradient descent phase of the algorithm? Ive tried code as below but it only converges on a solution when alpha and beta equal 0.0. If I do have the norms in the right place, what size do alpha and beta typically take? Cheers, Ben.

    W[i][j] -= learning_rate * dW[i][j] – alpha*L1_norm – beta*L2_norm;

  21. Avatar
    Mathias March 4, 2022 at 9:21 am #

    This website is pure gold when you’re trying to learn about neural networks, thank you guys for really helping me out!

    • Avatar
      James Carmichael March 4, 2022 at 2:23 pm #

      Great feedback Mathias!

  22. Avatar
    Vaishali June 28, 2022 at 4:11 pm #

    Hi Jason. I understood what is L1 norm and L2 norm using this article nicely. I want to know what is L2,1 – norm ?

  23. Avatar
    spike November 19, 2022 at 7:02 am #

    Although this is a question that is unrelated to this article, I would appreciate it if you could answer it. when to use (;) in describing a specific probability in the context of mixture models? thank you.

Leave a Reply