In a previous post, we introduced the method of Lagrange multipliers to find local minima or local maxima of a function with equality constraints. The same method can be applied to those with inequality constraints as well.

In this tutorial, you will discover the method of Lagrange multipliers applied to find the local minimum or maximum of a function when inequality constraints are present, optionally together with equality constraints.

After completing this tutorial, you will know

- How to find points of local maximum or minimum of a function with equality constraints
- Method of Lagrange multipliers with equality constraints

Let’s get started.

## Prerequisites

For this tutorial, we assume that you already have reviewed:

- Derivative of functions
- Function of several variables, partial derivatives and gradient vectors
- A gentle introduction to optimization
- Gradient descent

as well as

You can review these concepts by clicking on the links above.

## Constrained Optimization and Lagrangians

Extending from our previous post, a constrained optimization problem can be generally considered as

$$

\begin{aligned}

\min && f(X) \\

\textrm{subject to} && g(X) &= 0 \\

&& h(X) &\ge 0 \\

&& k(X) &\le 0

\end{aligned}

$$

where $X$ is a scalar or vector values. Here, $g(X)=0$ is the equality constraint, and $h(X)\ge 0$, $k(X)\le 0$ are inequality constraints. Note that we always use $\ge$ and $\le$ rather than $\gt$ and $\lt$ in optimization problems because the former defined a **closed set** in mathematics from where we should look for the value of $X$. These can be many constraints of each type in an optimization problem.

The equality constraints are easy to handle but the inequality constraints are not. Therefore, one way to make it easier to tackle is to convert the inequalities into equalities, by introducing **slack variables**:

$$

\begin{aligned}

\min && f(X) \\

\textrm{subject to} && g(X) &= 0 \\

&& h(X) – s^2 &= 0 \\

&& k(X) + t^2 &= 0

\end{aligned}

$$

When something is negative, adding a certain positive quantity into it will make it equal to zero, and vice versa. That quantity is the slack variable; the $s^2$ and $t^2$ above are examples. We deliberately put $s^2$ and $t^2$ terms there to denote that they must not be negative.

With the slack variables introduced, we can use the Lagrange multipliers approach to solve it, in which the Lagrangian is defined as:

$$

L(X, \lambda, \theta, \phi) = f(X) – \lambda g(X) – \theta (h(X)-s^2) + \phi (k(X)+t^2)

$$

It is useful to know that, for the optimal solution $X^*$ to the problem, the inequality constraints are either having the equality holds (which the slack variable is zero), or not. For those inequality constraints with their equality hold are called the active constraints. Otherwise, the inactive constraints. In this sense, you can consider that the equality constraints are always active.

## The Complementary Slackness Condition

The reason we need to know whether a constraint is active or not is because of the Krush-Kuhn-Tucker (KKT) conditions. Precisely, the KKT conditions describe what happens when $X^*$ is the optimal solution to a constrained optimization problem:

- The gradient of the Lagrangian function is zero
- All constraints are satisfied
- The inequality constraints satisfied complementary slackness condition

The most important of them is the complementary slackness condition. While we learned that optimization problem with equality constraint can be solved using Lagrange multiplier which the gradient of the Lagrangian is zero at the optimal solution, the complementary slackness condition extends this to the case of inequality constraint by saying that at the optimal solution $X^*$, either the Lagrange multiplier is zero or the corresponding inequality constraint is active.

The use of complementary slackness condition is to help us explore different cases in solving the optimization problem. It is the best to be explained with an example.

## Example 1: Mean-variance portfolio optimization

This is an example from finance. If we have 1 dollar and were to engage in two different investments, in which their return is modeled as a bi-variate Gaussian distribution. How much should we invest in each to minimize the overall variance in return?

This optimization problem, also known as Markowitz mean-variance portfolio optimization, is formulated as:

$$

\begin{aligned}

\min && f(w_1, w_2) &= w_1^2\sigma_1^2+w_2^2\sigma_2^2+2w_1w_2\sigma_{12} \\

\textrm{subject to} && w_1+w_2 &= 1 \\

&& w_1 &\ge 0 \\

&& w_1 &\le 1

\end{aligned}

$$

which the last two are to bound the weight of each investment to between 0 and 1 dollar. Let’s assume $\sigma_1^2=0.25$, $\sigma_2^2=0.10$, $\sigma_{12} = 0.15$ Then the Lagrangian function is defined as:

$$

\begin{aligned}

L(w_1,w_2,\lambda,\theta,\phi) =& 0.25w_1^2+0.1w_2^2+0.3w_1w_2 \\

&- \lambda(w_1+w_2-1) \\

&- \theta(w_1-s^2) – \phi(w_1-1+t^2)

\end{aligned}

$$

and we have the gradients:

$$

\begin{aligned}

\frac{\partial L}{\partial w_1} &= 0.5w_1+0.3w_2-\lambda-\theta-\phi \\

\frac{\partial L}{\partial w_2} &= 0.2w_2+0.3w_1-\lambda \\

\frac{\partial L}{\partial\lambda} &= 1-w_1-w_2 \\

\frac{\partial L}{\partial\theta} &= s^2-w_1 \\

\frac{\partial L}{\partial\phi} &= 1-w_1-t^2

\end{aligned}

$$

From this point onward, the complementary slackness condition have to be considered. We have two slack variables $s$ and $t$ and the corresponding Lagrange multipliers are $\theta$ and $\phi$. We now have to consider whether a slack variable is zero (which the corresponding inequality constraint is active) or the Lagrange multiplier is zero (the constraint is inactive). There are four possible cases:

- $\theta=\phi=0$ and $s^2>0$, $t^2>0$
- $\theta\ne 0$ but $\phi=0$, and $s^2=0$, $t^2>0$
- $\theta=0$ but $\phi\ne 0$, and $s^2>0$, $t^2=0$
- $\theta\ne 0$ and $\phi\ne 0$, and $s^2=t^2=0$

For case 1, using $\partial L/\partial\lambda=0$, $\partial L/\partial w_1=0$ and $\partial L/\partial w_2=0$ we get

$$

\begin{align}

w_2 &= 1-w_1 \\

0.5w_1 + 0.3w_2 &= \lambda \\

0.3w_1 + 0.2w_2 &= \lambda

\end{align}

$$

which we get $w_1=-1$, $w_2=2$, $\lambda=0.1$. But with $\partial L/\partial\theta=0$, we get $s^2=-1$, which we cannot find a solution ($s^2$ cannot be negative). Thus this case is infeasible.

For case 2, with $\partial L/\partial\theta=0$ we get $w_1=0$. Hence from $\partial L/\partial\lambda=0$, we know $w_2=1$. And with $\partial L/\partial w_2=0$, we found $\lambda=0.2$ and from $\partial L/\partial w_1$ we get $\phi=0.1$. In this case, the objective function is 0.1

For case 3, with $\partial L/\partial\phi=0$ we get $w_1=1$. Hence from $\partial L/\partial\lambda=0$, we know $w_2=0$. And with $\partial L/\partial w_2=0$, we get $\lambda=0.3$ and from $\partial L/\partial w_1$ we get $\theta=0.2$. In this case, the objective function is 0.25

For case 4, we get $w_1=0$ from $\partial L/\partial\theta=0$ but $w_1=1$ from $\partial L/\partial\phi=0$. Hence this case is infeasible.

Comparing the objective function from case 2 and case 3, we see that the value from case 2 is lower. Hence that is taken as our solution to the optimization problem, with the optimal solution attained at $w_1=0$, $w_2=1$.

As an exercise, you can retry the above with $\sigma_{12}=-0.15$. The solution would be 0.0038 attained when $w_1=\frac{5}{13}$, with the two inequality constraints inactive.

### Want to Get Started With Calculus for Machine Learning?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

## Example 2: Water-filling algorithm

This is an example from communication engineering. If we have a channel (say, a wireless bandwidth) in which the noise power is $N$ and the signal power is $S$, the channel capacity (in terms of bits per second) is proportional to $\log_2(1+S/N)$. If we have $k$ similar channels, each has its own noise and signal level, the total capacity of all channels is the sum $\sum_i \log_2(1+S_i/N_i)$.

Assume we are using a battery that can give only 1 watt of power and this power have to distribute to the $k$ channels (denoted as $p_1,\cdots,p_k$). Each channel may have different attenuation so at the end, the signal power is discounted by a gain $g_i$ for each channel. Then the maximum total capacity we can achieve by using these $k$ channels is formulated as an optimization problem

$$

\begin{aligned}

\max && f(p_1,\cdots,p_k) &= \sum_{i=1}^k \log_2\left(1+\frac{g_ip_i}{n_i}\right) \\

\textrm{subject to} && \sum_{i=1}^k p_i &= 1 \\

&& p_1,\cdots,p_k &\ge 0 \\

\end{aligned}

$$

For convenience of differentiation, we notice $\log_2x=\log x/\log 2$ and $\log(1+g_ip_i/n_i)=\log(n_i+g_ip_i)-\log(n_i)$, hence the objective function can be replaced with

$$

f(p_1,\cdots,p_k) = \sum_{i=1}^k \log(n_i+g_ip_i)

$$

Assume we have $k=3$ channels, each has noise level of 1.0, 0.9, 1.0 respectively, and the channel gain is 0.9, 0.8, 0.7, then the optimization problem is

$$

\begin{aligned}

\max && f(p_1,p_2,p_k) &= \log(1+0.9p_1) + \log(0.9+0.8p_2) + \log(1+0.7p_3)\\

\textrm{subject to} && p_1+p_2+p_3 &= 1 \\

&& p_1,p_2,p_3 &\ge 0

\end{aligned}

$$

We have three inequality constraints here. The Lagrangian function is defined as

$$

\begin{aligned}

& L(p_1,p_2,p_3,\lambda,\theta_1,\theta_2,\theta_3) \\

=\ & \log(1+0.9p_1) + \log(0.9+0.8p_2) + \log(1+0.7p_3) \\

& – \lambda(p_1+p_2+p_3-1) \\

& – \theta_1(p_1-s_1^2) – \theta_2(p_2-s_2^2) – \theta_3(p_3-s_3^2)

\end{aligned}

$$

The gradient is therefore

$$

\begin{aligned}

\frac{\partial L}{\partial p_1} & = \frac{0.9}{1+0.9p_1}-\lambda-\theta_1 \\

\frac{\partial L}{\partial p_2} & = \frac{0.8}{0.9+0.8p_2}-\lambda-\theta_2 \\

\frac{\partial L}{\partial p_3} & = \frac{0.7}{1+0.7p_3}-\lambda-\theta_3 \\

\frac{\partial L}{\partial\lambda} &= 1-p_1-p_2-p_3 \\

\frac{\partial L}{\partial\theta_1} &= s_1^2-p_1 \\

\frac{\partial L}{\partial\theta_2} &= s_2^2-p_2 \\

\frac{\partial L}{\partial\theta_3} &= s_3^2-p_3 \\

\end{aligned}

$$

But now we have 3 slack variables and we have to consider 8 cases:

- $\theta_1=\theta_2=\theta_3=0$, hence none of $s_1^2,s_2^2,s_3^2$ are zero
- $\theta_1=\theta_2=0$ but $\theta_3\ne 0$, hence only $s_3^2=0$
- $\theta_1=\theta_3=0$ but $\theta_2\ne 0$, hence only $s_2^2=0$
- $\theta_2=\theta_3=0$ but $\theta_1\ne 0$, hence only $s_1^2=0$
- $\theta_1=0$ but $\theta_2,\theta_3$ non-zero, hence only $s_2^2=s_3^2=0$
- $\theta_2=0$ but $\theta_1,\theta_3$ non-zero, hence only $s_1^2=s_3^2=0$
- $\theta_3=0$ but $\theta_1,\theta_2$ non-zero, hence only $s_1^2=s_2^2=0$
- all of $\theta_1,\theta_2,\theta_3$ are non-zero, hence $s_1^2=s_2^2=s_3^2=0$

Immediately we can tell case 8 is infeasible since from $\partial L/\partial\theta_i=0$ we can make $p_1=p_2=p_3=0$ but it cannot make $\partial L/\partial\lambda=0$.

For case 1, we have

$$

\frac{0.9}{1+0.9p_1}=\frac{0.8}{0.9+0.8p_2}=\frac{0.7}{1+0.7p_3}=\lambda

$$

from $\partial L/\partial p_1=\partial L/\partial p_2=\partial L/\partial p_3=0$. Together with $p_3=1-p_1-p_2$ from $\partial L/\partial\lambda=0$, we found the solution to be $p_1=0.444$, $p_2=0.430$, $p_3=0.126$, and the objective function $f(p_1,p_2,p_3)=0.639$.

For case 2, we have $p_3=0$ from $\partial L/\partial\theta_3=0$. Further, using $p_2=1-p_1$ from $\partial L/\partial\lambda=0$, and

$$

\frac{0.9}{1+0.9p_1}=\frac{0.8}{0.9+0.8p_2}=\lambda

$$

from $\partial L/\partial p_1=\partial L/\partial p_2=0$, we can solve for $p_1=0.507$ and $p_2=0.493$. The objective function $f(p_1,p_2,p_3)=0.634$.

Similarly in case 3, $p_2=0$ and we solved $p_1=0.659$ and $p_3=0.341$, with the objective function $f(p_1,p_2,p_3)=0.574$.

In case 4, we have $p_1=0$, $p_2=0.652$, $p_3=0.348$, and the objective function $f(p_1,p_2,p_3)=0.570$.

Case 5 we have $p_2=p_3=0$ and hence $p_3=1$. Thus we have the objective function $f(p_1,p_2,p_3)=0.0.536$.

Similarly in case 6 and case 7, we have $p_2=1$ and $p_1=1$ respectively. The objective function attained 0.531 and 0.425 respectively.

Comparing all these cases, we found that the maximum value that the objective function attained is in case 1. Hence the solution to this optimization problem is

$p_1=0.444$, $p_2=0.430$, $p_3=0.126$, with $f(p_1,p_2,p_3)=0.639$.

## Extensions and Further Reading

While in the above example, we introduced the slack variables into the Lagrangian function, some books may prefer not to add the slack variables but to limit the Lagrange multipliers for inequality constraints as positive. In that case you may see the Lagrangian function written as

$$

L(X, \lambda, \theta, \phi) = f(X) – \lambda g(X) – \theta h(X) + \phi k(X)

$$

but requires $\theta\ge 0;\phi\ge 0$.

The Lagrangian function is also useful to apply to primal-dual approach for finding the maximum or minimum. This is particularly helpful if the objectives or constraints are non-linear, which the solution may not be easily found.

Some books that covers this topic are:

- Convex Optimization by Stephen Boyd and Lieven Vandenberghe, 2004
- Chapter 4 of Deep Learning by Ian Goodfellow et al, 2016

## Summary

In this tutorial, you discovered how the method of Lagrange multipliers can be applied to inequality constraints. Specifically, you learned:

- Lagrange multipliers and the Lagrange function in presence of inequality constraints
- How to use KKT conditions to solve an optimization problem when inequality constraints are given

I have a question on the four listed cases in Example 1.

Shouldn’t case 2 be s^2 > 0 and t^2 = 0?

Shouldn’t case 3 be s^2 = 0 and t^2 > 0?

Since either the slack variable OR the Lagrange multiplier should be 0, instead of both/neither of them?

Complementary slackness condition means either the multiplier or the slack variable is zero. Both of them zero are also OK but that happens by chance. But you’re right for the s^2 and t^2. I corrected these. Thank you.

In your Lagrangian function some time you used: f(x) + \lambda g(x) and other times f(x) -\lambda g(x)

is there any rule to follow?

Hi Learner…The following may be of interest to you.

https://jonathan-hui.medium.com/machine-learning-lagrange-multiplier-dual-decomposition-4afe66158c9

This is a fantastically helpful article thank you. The English could do with some polishing, if you want I am happy to edit it for you.

Thank you for your feedback, support and recommendations Georges!

Nice tutorial. I love it it refreshes me. Thank you

Thank you Femi for your support and kind feedback! We greatly appreciate it!

I am a bit concerned about settings where the optimal solution doesn’t have an active constraint, but the solution to the unconstrained optimization does not respect the constraint.

For example, imagine the function equal to e^-(x-1)^2+2*e^-(x+1)^2. If I wanted to optimize this function under the constraint x>=0, the solution is x=1. Yet, using the approach described, if I remove the constraint, I get the unfeasible solution x=-1, if I make it active, I get the suboptimal x=0.

Am I misunderstanding something?

Nevermind, I just realized, you have to check for all saddle points for the inactive constraint, so you would find both x=-1 and x=1, you then eliminate x=-1 because it’s unfeasible.

Thank you for you following up on your original question with an update Yliaster! Let us know if we can further clarify any of the concepts.