#### Optimization for Machine Learning Crash Course.

Find function optima with Python in 7 days.

All machine learning models involve optimization. As a practitioner, we optimize for the most suitable hyperparameters or the subset of features. Decision tree algorithm optimize for the split. Neural network optimize for the weight. Most likely, we use computational algorithms to optimize.

There are many ways to optimize numerically. SciPy has a number of functions handy for this. We can also try to implement the optimization algorithms on our own.

In this crash course, you will discover how you can get started and confidently run algorithms to optimize a function with Python in seven days.

This is a big and important post. You might want to bookmark it.

**Kick-start your project** with my new book Optimization for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

## Who Is This Crash-Course For?

Before we get started, letâ€™s make sure you are in the right place.

This course is for developers that may know some applied machine learning. Perhaps you have built some models and did some projects end-to-end, or modified from existing example code from popular tools to solve your own problem.

The lessons in this course do assume a few things about you, such as:

- You know your way around basic Python for programming.
- You may know some basic NumPy for array manipulation.
- You heard about gradient descent, simulated annealing, BFGS, or some other optimization algorithms and want to deepen your understanding.

You do NOT need to be:

- A math wiz!
- A machine learning expert!

This crash course will take you from a developer who knows a little machine learning to a developer who can effectively and competently apply function optimization algorithms.

Note: This crash course assumes you have a working Python 3 SciPy environment with at least NumPy installed. If you need help with your environment, you can follow the step-by-step tutorial here:

## Crash-Course Overview

This crash course is broken down into seven lessons.

You could complete one lesson per day (recommended) or complete all of the lessons in one day (hardcore). It really depends on the time you have available and your level of enthusiasm.

Below is a list of the seven lessons that will get you started and productive with optimization in Python:

**Lesson 01**: Why optimize?**Lesson 02**: Grid search**Lesson 03**: Optimization algorithms in SciPy**Lesson 04**: BFGS algorithm**Lesson 05**: Hill-climbing algorithm**Lesson 06**: Simulated annealing**Lesson 07**: Gradient descent

Each lesson could take you 60 seconds or up to 30 minutes. Take your time and complete the lessons at your own pace. Ask questions, and even post results in the comments below.

The lessons might expect you to go off and find out how to do things. I will give you hints, but part of the point of each lesson is to force you to learn where to go to look for help with and about the algorithms and the best-of-breed tools in Python. (**Hint**: *I have all of the answers on this blog; use the search box*.)

**Post your results in the comments**; Iâ€™ll cheer you on!

Hang in there; donâ€™t give up.

## Lesson 01: Why optimize?

In this lesson, you will discover why and when we want to do optimization.

Machine learning is different from other kinds of software projects in the sense that it is less trivial on how we should write the program. A toy example in programming is to write a for loop to print numbers from 1 to 100. You know exactly you need a variable to count, and there should be 100 iterations of the loop to count. A toy example in machine learning is to use neural network for regression, but you have no idea how many iterations you need exactly to train the model. You might set it too few or too many and you don’t have a rule to tell what is the right number. Hence many people consider machine learning models as a **black box**. The consequence is that, while the model has many variables that we can tune (the hyperparameters, for example) we do not know what should be the correct values until we tested it out.

In this lesson, you will discover why machine learning practitioners should study optimization to improve their skills and capabilities. Optimization is also called function optimization in mathematics that aimed to locate the maximum or minimum value of certain **function**. For different nature of the function, different methods can be applied.

Machine learning is about developing predictive models. Whether one model is better than another, we have some evaluation metrics to measure a model’s performance subject to a particular data set. In this sense, if we consider the parameters that created the model as the input, the inner algorithm of the model and the data set in concern as constants, and the metric that evaluated from the model as the output, then we have a function constructed.

Take decision tree as an example. We know it is a binary tree because every intermediate node is asking a yes-no question. This is constant and we cannot change it. But how deep this tree should be is a hyperparameter that we can control. What features and how many features from the data we allow the decision tree to use is another. A different value for these hyperparameters will change the decision tree model, which in turn gives a different metric, such as average accuracy from k-fold cross validation in classification problems. Then we have a function defined that takes the hyperparameters as input and the accuracy as output.

From the perspective of the decision tree library, once you provided the hyperparameters and the training data, it can also consider them as constants and the selection of features and the thresholds for split at every node as input. The metric is still the output here because the decision tree library shared the same goal of making the best prediction. Therefore, the library also has a function defined, but different from the one mentioned above.

The **function**Â here does not mean you need to explicitly define a function in the programming language. A conceptual one is suffice. What we want to do next is to manipulate on the input and check the output until we found the best output is achieved. In case of machine learning, the best can mean

- Highest accuracy, or precision, or recall
- Largest AUC of ROC
- Greatest F1 score in classification or R
^{2}score in regression - Least error, or log-loss

or something else in this line. We can manipulate the input by random methods such as sampling or random perturbation. We can also assume the function has certain properties and try out a sequence of inputs to exploit these properties. Of course, we can also check all possible input and as we exhausted the possibility, we will know the best answer.

These are the basics of why we want to do optimization, what it is about, and how we can do it. You may not notice it, but training a machine learning model is doing optimization. You may also explicitly perform optimization to select features or fine-tune hyperparameters. As you can see, optimization is useful in machine learning.

### Your Task

For this lesson, you must find a machine learning model and list three examples that optimization might be used or might help in training and using the model. These may be related to some of the reasons above, or they may be your own personal motivations.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will discover how to perform grid search on an arbitrary function.

## Lesson 02: Grid searcch

In this lesson, you will discover grid search for optimization.

Let’s start with this function:

*f* (*x*, *y*) = *x*^{2} + *y*^{2}

This is a function with two-dimensional input (*x*, *y*) and one-dimensional output. What can we do to find the minimum of this function? In other words, for what *x* and *y*, we can have the least *f* (*x*, *y*)?

Without looking at what *f* (*x*, *y*) is, we can first assume the *x* and *y* are in some bounded region, say, from -5 to +5. Then we can check for every combination of *x* and *y* in this range. If we remember the value of *f* (*x*, *y*) and keep track on the least we ever saw, then we can find the minimum of it after exhausting the region. In Python code, it is like this:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
from numpy import arange, inf # objective function def objective(x, y): return x**2.0 + y**2.0 # define range for input r_min, r_max = -5.0, 5.0 # generate a grid sample from the domain sample = list() step = 0.1 for x in arange(r_min, r_max+step, step): for y in arange(r_min, r_max+step, step): sample.append([x,y]) # evaluate the sample best_eval = inf best_x, best_y = None, None for x,y in sample: eval = objective(x,y) if eval < best_eval: best_x = x best_y = y best_eval = eval # summarize best solution print('Best: f(%.5f,%.5f) = %.5f' % (best_x, best_y, best_eval)) |

This code scan from the lowerbound of the range -5 to upperbound +5 with each step of increment of 0.1. This range is same for both *x* and *y*. This will create a large number of samples of the (*x*, *y*) pair. These samples are created out of combinations of *x* and *y* over a range. If we draw their coordinate on a graph paper, they form a grid, and hence we call this grid search.

With the grid of samples, then we evaluate the objective function *f* (*x*, *y*) for every sample of (*x*, *y*). We keep track on the value, and remember the least we ever saw. Once we exhausted the samples on the grid, we recall the least value that we found as the result of the optimization.

### Your Task

For this lesson, you should lookup how to use numpy.meshgrid() function and rewrite the example code. Then you can try to replace the objective function into *f* (*x*, *y*, *z*) = (*x* – *y* + 1)^{2} + *z*^{2}, which is a function with 3D input.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will learn how to use scipy to optimize a function.

## Lesson 03: Optimization algorithms in SciPy

In this lesson, you will discover how you can make use of SciPy to optimize your function.

There are a lot of optimization algorithms in the literature. Each has its strengths and weaknesses, and each is good for a different kind of situation. Reusing the same function we introduced in the previous lesson,

*f* (*x*, *y*) = *x*^{2} + *y*^{2}

we can make use of some predefined algorithms in SciPy to find its minimum. Probably the easiest is the Nelder-Mead algorithm. This algorithm is based on a series of rules to determine how to explore the surface of the function. Without going into the detail, we can simply call SciPy and apply Nelder-Mead algorithm to find a function’s minimum:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
from scipy.optimize import minimize from numpy.random import rand # objective function def objective(x): return x[0]**2.0 + x[1]**2.0 # define range for input r_min, r_max = -5.0, 5.0 # define the starting point as a random sample from the domain pt = r_min + rand(2) * (r_max - r_min) # perform the search result = minimize(objective, pt, method='nelder-mead') # summarize the result print('Status : %s' % result['message']) print('Total Evaluations: %d' % result['nfev']) # evaluate solution solution = result['x'] evaluation = objective(solution) print('Solution: f(%s) = %.5f' % (solution, evaluation)) |

In the code above, we need to write our function with a single vector argument. Hence virtually the function becomes

*f* (*x*[0], *x*[1]) = (*x*[0])^{2} + (*x*[1])^{2}

Nelder-Mead algorithm needs a starting point. We choose a random point in the range of -5 to +5 for that (rand(2) is numpy’s way to generate a random coordinate pair between 0 and 1). The function minimize() returns a OptimizeResult object, which contains information about the result that is accessible via keys. The “message” key provides a human-readable message about the success or failure of the search, and the “nfev” key tells the number of function evaluations performed in the course of optimization. The most important one is “x” key, which specifies the input values that attained the minimum.

Nelder-Mead algorithm works well for **convex functions**, which the shape is smooth and like a basin. For more complex function, the algorithm may stuck at a **local optimum** but fail to find the real global optimum.

### Your Task

For this lesson, you should replace the objective function in the example code above with the following:

1 2 3 4 5 |
from numpy import e, pi, cos, sqrt, exp def objective(v): x, y = v return ( -20.0 * exp(-0.2 * sqrt(0.5 * (x**2 + y**2))) - exp(0.5 * (cos(2*pi*x)+cos(2*pi*y))) + e + 20 ) |

This defined the Ackley function. The global minimum is at v=[0,0]. However, Nelder-Mead most likely cannot find it because this function has many local minima. Try repeat your code a few times and observe the output. You should get a different output each time you run the program.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will learn how to use the same SciPy function to apply a different optimization algorithm.

## Lesson 04: BFGS algorithm

In this lesson, you will discover how you can make use of SciPy to apply BFGS algorithm to optimize your function.

As we have seen in the previous lesson, we can make use of the minimize() function from scipy.optimize to optimize a function using Nelder-Meadd algorithm. This is the simple “pattern search” algorithm that does not need to know the derivatives of a function.

First-order derivative means to differentiate the objective function once. Similarly, second-order derivative is to differentiate the first-order derivative one more time. If we have the second-order derivative of the objective function, we can apply the Newton’s method to find its optimum.

There is another class of optimization algorithm that can approximate the second-order derivative from the first order derivative, and use the approximation to optimize the objective function. They are called the **quasi-Newton methods**. BFGS is the most famous one of this class.

Revisiting the same objective function that we used in previous lessons,

*f* (*x*, *y*) = *x*^{2} + *y*^{2}

we can tell that the first-order derivative is:

âˆ‡*f* = [2*x*, 2*y*]

This is a vector of two components, because the function *f* (*x*, *y*) receives a vector value of two components (*x*, *y*) and returns a scalar value.

If we create a new function for the first-order derivative, we can call SciPy and apply the BFGS algorithm:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
from scipy.optimize import minimize from numpy.random import rand # objective function def objective(x): return x[0]**2.0 + x[1]**2.0 # derivative of the objective function def derivative(x): return [x[0] * 2, x[1] * 2] # define range for input r_min, r_max = -5.0, 5.0 # define the starting point as a random sample from the domain pt = r_min + rand(2) * (r_max - r_min) # perform the bfgs algorithm search result = minimize(objective, pt, method='BFGS', jac=derivative) # summarize the result print('Status : %s' % result['message']) print('Total Evaluations: %d' % result['nfev']) # evaluate solution solution = result['x'] evaluation = objective(solution) print('Solution: f(%s) = %.5f' % (solution, evaluation)) |

The first-order derivative of the objective function is provided to the minimize() function with the “jac” argument. The argument is named after **Jacobian matrix**, which is how we call the first-order derivative of a function that takes a vector and returns a vector. The BFGS algorithm will make use of the first-order derivative to compute the inverse of the **Hessian matrix** (i.e., the second-order derivative of a vector function) and use it to find the optima.

Besides BFGS, there is also L-BFGS-B. It is a version of the former that uses less memory (the “L”) and the domain is bounded to a region (the “B”). To use this variant, we simply replace the name of the method:

1 2 |
... result = minimize(objective, pt, method='L-BFGS-B', jac=derivative) |

### Your Task

For this lesson, you should create a function with much more parameters (i.e., the vector argument to the function is much more than two components) and observe the performance of BFGS and L-BFGS-B. Do you notice the difference in speed? How different are the result from these two methods? What happen if your function is not convex but have many local optima?

Post your answer in the comments below. I would love to see what you come up with.

## Lesson 05: Hill-climbing algorithm

In this lesson, you will discover how to implement hill-climbing algorithm and use it to optimize your function.

The idea of hill-climbing is to start from a point on the objective function. Then we move the point a bit in a random direction. In case the move allows us to find a better solution, we keep the new position. Otherwise we stay with the old. After enough iterations of doing this, we should be close enough to the optimum of this objective function. The progress is named because it is like we are climbing on a hill, which we keep going up (or down) in any direction whenever we can.

In Python, we can write the above hill-climbing algorithm for minimization as a function:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
from numpy.random import randn def in_bounds(point, bounds): # enumerate all dimensions of the point for d in range(len(bounds)): # check if out of bounds for this dimension if point[d] < bounds[d, 0] or point[d] > bounds[d, 1]: return False return True def hillclimbing(objective, bounds, n_iterations, step_size): # generate an initial point solution = None while solution is None or not in_bounds(solution, bounds): solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # evaluate the initial point solution_eval = objective(solution) # run the hill climb for i in range(n_iterations): # take a step candidate = None while candidate is None or not in_bounds(candidate, bounds): candidate = solution + randn(len(bounds)) * step_size # evaluate candidate point candidte_eval = objective(candidate) # check if we should keep the new point if candidte_eval <= solution_eval: # store the new point solution, solution_eval = candidate, candidte_eval # report progress print('>%d f(%s) = %.5f' % (i, solution, solution_eval)) return [solution, solution_eval] |

This function allows any objective function to be passed as long as it takes a vector and returns a scalar value. The “bounds” argument should be a numpy array of *n*Ã—2 dimension, which *n* is the size of the vector that the objective function expects. It tells the lower- and upper-bound of the range we should look for the minimum. For example, we can set up the bound as follows for the objective function that expects two dimensional vectors (like the one in the previous lesson) and the components of the vector to be between -5 to +5:

1 |
bounds = np.asarray([[-5.0, 5.0], [-5.0, 5.0]]) |

This “hillclimbing” function will randomly pick an initial point within the bound, then test the objective function in iterations. Whenever it can find the objective function yields a less value, the solution is remembered and the next point to test is generated from its neighborhood.

### Your Task

For this lesson, you should provide your own objective function (such as copy over the one from previous lesson), set up the “n_iterations” and “step_size” and apply the “hillclimbing” function to find the minimum. Observe how the algorithm finds a solution. Try with different values of “step_size” and compare the number of iterations needed to reach the proximity of the final solution.

Post your answer in the comments below. I would love to see what you come up with.

## Lesson 06: Simulated annealing

In this lesson, you will discover how simulated annealing works and how to use it.

For the non-convex functions, the algorithms you learned in previous lessons may be trapped easily at local optima and failed to find the global optima. The reason is because of the greedy nature of the algorithm: Whenever a better solution is found, it will not let go. Hence if a even better solution exists but not in the proximity, the algorithm will fail to find it.

Simulated annealing try to improve on this behavior by making a balance between *exploration* and *exploitation*. At the beginning, when the algorithm is not knowing much about the function to optimize, it prefers to explore other solutions rather than stay with the best solution found. At later stage, as more solutions are explored the chance of finding even better solutions is diminished, the algorithm will prefer to remain in the neighborhood of the best solution it found.

The following is the implementation of simulated annealing as a Python function:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
from numpy.random import randn, rand def simulated_annealing(objective, bounds, n_iterations, step_size, temp): # generate an initial point best = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # evaluate the initial point best_eval = objective(best) # current working solution curr, curr_eval = best, best_eval # run the algorithm for i in range(n_iterations): # take a step candidate = curr + randn(len(bounds)) * step_size # evaluate candidate point candidate_eval = objective(candidate) # check for new best solution if candidate_eval < best_eval: # store new best point best, best_eval = candidate, candidate_eval # report progress print('>%d f(%s) = %.5f' % (i, best, best_eval)) # difference between candidate and current point evaluation diff = candidate_eval - curr_eval # calculate temperature for current epoch t = temp / float(i + 1) # calculate metropolis acceptance criterion metropolis = exp(-diff / t) # check if we should keep the new point if diff < 0 or rand() < metropolis: # store the new current point curr, curr_eval = candidate, candidate_eval return [best, best_eval] |

Similar to the hill-climbing algorithm in the previous lesson, the function starts with a random initial point. Also similar to that in previous lesson, the algorithm runs in loops prescribed by the count “n_iterations”. In each iteration, a random neighborhood point of the current point is picked and the objective function is evaluated on it. The best solution ever found is remembered in the variable “best” and “best_eval”. The difference to the hill-climbing algorithm is that, the current point “curr” in each iteration is not necessarily the best solution. Whether the point is moved to a neighborhood or stay depends on a probability that related to the number of iterations we did and how much improvement the neighborhood can make. Because of this stochastic nature, we have a chance to get out of the local minima for a better solution. Finally, regardless where we end up, we always return the best solution ever found among the iterations of the simulated annealing algorithm.

In fact, most of the hyperparameter tuning or feature selection problems are encountered in machine learning are not convex. Hence simulated annealing should be more suitable then hill-climbing for these optimization problems.

### Your Task

For this lesson, you should repeat the exercise you did in the previous lesson with the simulated annealing code above. Try with the objective function *f* (*x*, *y*) = *x*^{2} + *y*^{2}, which is a convex one. Do you see simulated annealing or hill climbing takes less iteration? Replace the objective function with the Ackley function introduced in Lesson 03. Do you see the minimum found by simulated annealing or hill climbing is smaller?

Post your answer in the comments below. I would love to see what you come up with.

## Lesson 07: Gradient descent

In this lesson, you will discover how you can implement gradient descent algorithm.

Gradient descent algorithm is *the* algorithm used to train a neural network. Although there are many variants, all of them are based on **gradient**, or the first-order derivative, of the function. The idea lies in the physical meaning of a gradient of a function. If the function takes a vector and returns a scalar value, the gradient of the function at any point will tell you the **direction** that the function is increased the fastest. Hence if we aimed at finding the minimum of the function, the direction we should explore is the exact opposite of the gradient.

In mathematical equation, if we are looking for the minimum of *f* (*x*), where *x* is a vector, and the gradient of *f* (*x*) is denoted by âˆ‡*f* (*x*) (which is also a vector), then we know

*x*_{new} = *x* – *Î±Â *Ã— âˆ‡*f* (*x*)

will be closer to the minimum than *x*. Now let’s try to implement this in Python. Reusing the sample objective function and its derivative we learned in Day 4, this is the gradient descent algorithm and its use to find the minimum of the objective function:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
from numpy import asarray from numpy import arange from numpy.random import rand # objective function def objective(x): return x[0]**2.0 + x[1]**2.0 # derivative of the objective function def derivative(x): return asarray([x[0]*2, x[1]*2]) # gradient descent algorithm def gradient_descent(objective, derivative, bounds, n_iter, step_size): # generate an initial point solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # run the gradient descent for i in range(n_iter): # calculate gradient gradient = derivative(solution) # take a step solution = solution - step_size * gradient # evaluate candidate point solution_eval = objective(solution) # report progress print('>%d f(%s) = %.5f' % (i, solution, solution_eval)) return [solution, solution_eval] # define range for input bounds = asarray([[-5.0, 5.0], [-5.0, 5.0]]) # define the total iterations n_iter = 40 # define the step size step_size = 0.1 # perform the gradient descent search solution, solution_eval = gradient_descent(objective, derivative, bounds, n_iter, step_size) print("Solution: f(%s) = %.5f" % (solution, solution_eval)) |

This algorithm depends on not only the objective function but also its derivative. Hence it may not suitable for all kinds of problems. This algorithm also sensitive to the step size, which a too large step size with respect to the objective function may cause the gradient descent algorithm fail to converge. If this happens, we will see the progress is not moving toward lower value.

There are several variations to make gradient descent algorithm more robust, for example:

- Add a
**momentum**into the process, which the move is not only following the gradient but also partially the average of gradients in previous iterations. - Make the step sizes different for each component of the vector
*x* - Make the step size adaptive to the progress

### Your Task

For this lesson, you should run the example program above with a different “step_size” and “n_iter” and observe the difference in the progress of the algorithm. At what “step_size” you will see the above program not converge? Then try to add a new parameter *Î²* to the gradient_descent() function as the *momentum weight*, which the update rule now becomes

*x*_{new} = *x* – *Î±Â *Ã— âˆ‡*f* (*x*) – *Î²Â *Ã—Â *g*

where *g* is the average of âˆ‡*f* (*x*) in, for example, five previous iterations. Do you see any improvement to this optimization? Is it a suitable example for using momentum?

Post your answer in the comments below. I would love to see what you come up with.

This was the final lesson.

## The End!

(*Look How Far You Have Come*)

You made it. Well done!

Take a moment and look back at how far you have come.

You discovered:

- The importance of optimization in applied machine learning.
- How to do grid search to optimize by exhausting all possible solutions.
- How to use SciPy to optimize your own function.
- How to implement hill-climbing algorithm for optimization.
- How to use simulated annealing algorithm for optimization.
- What is gradient descent, how to use it, and some variation of this algorithm.

## Summary

**How did you do with the mini-course?**

Did you enjoy this crash course?

**Do you have any questions? Were there any sticking points?**

Let me know. Leave a comment below.

Task for lesson #1:

1. KNN – we want to make the difference between Y predicted to the real Y as small as possible.

2. K – means – we want the groups to be split in a similar way to the real splits in the data.

3. Linear Regression – just like the KNN, and also we want the explainable part will be the biggest possible without too many explanatory variables.

Thanks for the knowledge you spread, you are fantastic

Good answer. Keep on!

Hi Jason! Thanks for a great tutorial! I think part of the reason you were put on this earth is to help others enjoy mastering data science. Thank God for JBL. I will post my answer to Lesson 01 here:

Find a machine learning model: Bag of words.

List three examples that optimization might be used or might help in training and using the model:

1. Picking the best number of words to include in the vocabulary. Not too few, not too many.

2. UsingTF-IDF or not? Which yields a higher score (precision/recall)?

3. Should we use stemming or not? Which yields higher score?

I am not sure that 2 and 3 are what you are looking for… Thanks for any feedback!

Those make sense! Thanks for sharing. Keep on!

LESSON 02

from numpy import arange, inf

# objective function

def objective(x,y,z):

return (x – y + 1)**2.0 + z**2.0

# return x**2.0 + y**2.0

# define range for input

r_min, r_max = -5.0, 5.0

# generate a grid sample from the domain sample = list()

sample = []

step = 0.1

for x in arange(r_min, r_max+step, step):

for y in arange(r_min, r_max+step, step):

for z in arange(r_min, r_max+step, step):

sample.append([x,y,z])

# evaluate the sample

best_eval = inf

best_x, best_y, best_z = None, None, None

for x,y,z in sample:

eval = objective(x,y,z)

if eval >> Best: func(-5, -4, 0) = 0

Thanks for sharing. You can replace the three arange() if you use meshgrid(). But good job!

Lesson-1: Un-Supervised algos are my favorites…K-Means…Expectation Maximization…both require optimizing the clusters to the point where there are no inter-mingled outliers amongst the density distributions that were assumed. Les for K-Means since its around adjusting centroids but most around likelihood maximization in EM. Thx Jason for pushing us beyond our boundaries :)))

Good job. Thanks for sharing.

Hello everyone!

Thank you Adrian for the great course!

is there an error in Your task for Lesson 3? What does C stand for in objective function:

return ( -20.0 * exp(-0.2 * sqrt(0.5 * (x**2 + y**2)))

– exp(0.5 * (cos(2 * pi C *x)+cos(2*pi*y))) + e + 20 )

shouldn’t it be

return ( -20.0 * exp(-0.2 * sqrt(0.5 * (x**2 + y**2)))

– exp(0.5 * (cos(2*pi*x)+cos(2*pi*y))) + e + 20 ) ?

Yes, indeed. Good catch.

hello

i want to solve f(x) = x^3-500x^2+700x+100 with SA algorithm. could any one help me?

Did you try to feed it into the sample code?

Day 1 task:

One of starting point of machine learning model is linear regression. There are couple of methods to optimise. These could be

-Ordinary Least Squares

-Batch Gradient Descent

-Stochastic Gradient Descent

-Mini Batch Gradient Descent

Day 2 task:

Code:

from numpy import meshgrid, inf

# objective function

def objective(x,y,z):

return (x-y+1)**2.0+z**2.0

# define range for input

r_min, r_max=-5.0, 5.0

# generate a grid sample from the domain sample = list()

sample = []

step = 0.1

for x in meshgrid(r_min, r_max+step, step):

for y in meshgrid(r_min, r_max+step, step):

for z in meshgrid(r_min, r_max+step, step):

sample.append([x,y,z])

# evaluate the sample

best_eval = inf

best_x, best_y = None, None

for x,y,z in sample:

eval = objective(x,y,z)

if eval < best_eval:

best_x = x

best_y = y

best_z = z

best_eval = eval

# summarize best solution

print('Best: f(%.5f,%.5f,%.5f) = %.5f' % (best_x, best_y, best_z, best_eval))

Output:

Best: f(-5.00000,-5.00000,0.10000) = 1.01000

Day 3 task:

Code:

from scipy.optimize import minimize

from numpy.random import rand

from numpy import e, pi, cos, sqrt, exp

# objective function

def objective(v):

x, y = v

return ( -20.0 * exp(-0.2 * sqrt(0.5 * (x**2 + y**2)))

– exp(0.5 * (cos(2*pi*x)+cos(2*pi*y))) + e + 20 )

# define range for input

r_min, r_max = -5.0, 5.0

# define the starting point as a random sample from the domain

pt = r_min + rand(2) * (r_max – r_min)

# perform the search

result = minimize(objective, pt, method=’nelder-mead’)

# summarize the result

print(‘Status : %s’ % result[‘message’])

print(‘Total Evaluations: %d’ % result[‘nfev’])

# evaluate solution

solution = result[‘x’]

evaluation = objective(solution)

print(‘Solution: f(%s) = %.5f’ % (solution, evaluation))

Output (different every time when executed):

Status : Optimization terminated successfully.

Total Evaluations: 51

Solution: f([-1.97441491 -1.97444489]) = 6.55965

Day 4 task:

There are some difference in performance of these two methods, BFGS and L-BFGS-B. We have got total evaluations 4 and 3 respectively in these methods. That means L-BFGS-B works in better (faster speed) way. The solutions do not have much significant differences. For non-convex it does not work well.

Task 01:

Since my field is related to reinforcement learning, the main optimzation tasks would be:

– Finding optimal policy.

– Exploration/Exploitation trade-off.

– Optimizers for function approximation.

Task 02:

Here is the code:

import numpy as np

from numpy import arange, inf

def objective(x,y):

return x**2.0+y**2.0

r_min = -5

r_max = 5

step = 0.1

sample = []

x = arange(r_min, r_max+step, step)

y = arange(r_min, r_max+step, step)

xx,yy = np.meshgrid(x,y, sparse = True)

eval = objective(xx,yy)

[a,b] = np.shape(eval)

best_value = inf

best_x, best_y = None, None

for i in range(a):

for j in range(b):

if zz[i,j] < best_value:

best_x = x[i]

best_y = y[j]

best_value = zz[i,j]

print('Best: f(%.5f,%.5f) = %.5f' % (best_x, best_y, best_value))

Task 02: (Please remove previous entry)

Here is the code:

import numpy as np

from numpy import arange, inf

def objective(x,y,z):

>return (x+1-y)**2.0 + z**2.0 #x**2.0+y**2.0

r_min = -5

r_max = 5

step = 0.1

sample = []

x = arange(r_min, r_max+step, step)

y = arange(r_min, r_max+step, step)

z = arange(r_min, r_max+step, step)

xx,yy,zz = np.meshgrid(x,y,z, sparse = True)

eval = objective(xx,yy,zz)

[a,b,c] = np.shape(eval)

best_value = inf

best_x, best_y, best_z = None, None, None

for i in range(a):

>for j in range(b):

>>for k in range(c):

>>>if eval[i,j,k] >>> best_x = x[i]

>>>>best_y = y[j]

>>>>best_z = z[k]

>>>>best_value = eval[i,j,k]

print(‘Best: f(%.5f,%.5f,%.5f) = %.5f’ % (best_x, best_y, best_z, best_value))

Task 03:

The result changes every time indeed. Out of pure luck, I got this most optimal result:

Status : Optimization terminated successfully.

Total Evaluations: 74

Solution: f([-3.55101317e-05 5.56025188e-05]) = 0.00019

It was fun haha.

Question: How come we used x,y instead of x[0], x[1] this time?

from numpy import arange, inf

# objective function

def objective(x, y, z):

return (x-y+1)**2.0 + z**2.0

# define range for input

r_min, r_max = -5.0, 5.0

# generate a grid sample from the domain

sample = list()

step = 0.1

for x in arange(r_min, r_max+step, step):

for y in arange(r_min, r_max+step, step):

for z in arange(r_min, r_max+step, step):

sample.append([x,y,z])

# evaluate the sample

best_eval = inf

best_x, best_y, best_z = None, None, None

for x,y,z in sample:

eval = objective(x,y,z)

if eval < best_eval:

best_x = x

best_y = y

best_z = z

best_eval = eval

# summarize best solution

print('Best: f(%.5f,%.5f,%.5f) = %.5f' % (best_x, best_y, best_z, best_eval))

Best: f(-5.00000,-4.00000,-0.00000) = 0.00000

Hi,

I love your teaching method and I already recommended it to whoever is looking for a great reference. Thanks.

I have a question in optimizaton and I appriciate your thoutghs.

Can I use optimization in such setting that the function includes a machine learning prediction?

f = sum(pred(x1,x2,…,MLmodel)) ,

range x1,x2,… & start point x1,x2,…with the goal of maximizing that sum?

Is it reasonable and best approch to such a problem?

And can i introduce extra conditions like sum(X1) <=n1, …?

Hi Shima…Yes you could proceed in that manner. I would highly recommend that you investigate Bayesian Optimization.

Thank you so much for your fast response. I will look into you post on Bayesian Optimization.

Hi, for Lesson 5 (Hill Climbing Algorithm), you have to import rand as well in addition to randn, as rand is being used, and if not imported, gives an error. This seems to be a “typo” in the import statement. Lesson 6 (Simulated Annealing) imports both rand and randn, as it uses both, similar to the hill climbing algorithm. These are great lessons btw.

Thank you for the feedback Jude!

Thank you for the informative sessions. Can you please show an imputation example for disease outbreak prediction modeling using Bayesian Inferences.

Hi Isatou…You may find the following resource of interest:

http://www.ijtmgh.com/article_95527.html

In the task of lesson 4 I tried:

# objective function

def objective(x):

return x[0]**4.0 + x[1]**3.0 + 3*x[2]**5

# derivative of the objective function

def derivative(x):

return [4.0 * x[0]**3, 3.0*x[1]**2, 15*x[2]**4]

# define range for input

r_min, r_max = -5.0, 5.0

# define the starting point as a random sample from the domain

pt = r_min + rand(2) * (r_max – r_min)

# perform the bfgs algorithm search

result = minimize(objective, pt, method=’L-BFGS-B’, jac=derivative)

and I get the following error:

IndexError: index 2 is out of bounds for axis 0 with size 2

Why is that? The only defined bounds is the range [-5,5], which doesn’t really specify if it’s in the x0, x1 (or x2) direction. How do you solve this?

Hi Joseph…Did you type the code in or copy and paste it?

I copied the code and pasted it and it worked. Then I modified it and got that error. I copied&pasted the modified version to this message.

Hi Joseph…You may want to try it in Google Colab as well. StackOverflow is also a great resource to discuss errors with other machine learning practitioners.

In case someone encounters the same problem, the error was that it should be rand(3)

When developing models for sports betting one may be optimizing for profit rather than accuracy. The profit may be calculated to a mythical Â£1 stake on every selection made b y the model but this can result in distorted unrealistic results should a rank outsider pop up and win. It may be therefore that one is optimizing to variable stakes, in other words to always win Â£1 eg 50p on at 2/1 etc

Question: in lesson 05 Hill-climbing algorithm, in the code line 23, why using randn (normal distribution) instead of rand (uniform distribution)?.

Hi Youcef…Either could be used to illustrate the concept. Try both and let us know what you find.

I have tried both distributions (uniform and normal), with the follwoing configuration:

objective function: f = x**2 + y**2

step size = 0.2

iterations = 100

number of runs = 100

for the uniform distribution:

average found solution is: 0.011702

average number of hops to get to the solution is: 31

for the normal distribution:

average found solution is: 0.028597

average number of hops to get to the solution is: 31

So yes, basically we get the same results with both distributions.

Also for line 23, i have normalized the randomally generated vector, and let the step_size to decide its magnitude. i am not sure if this makes any difference.

Derivative-Free Optimization is one machine learning method. Particle swarm optimization, Genetic Algorithms ,Bayesian optimization are some of the examples.

Thank you Rehan for your feedback!

Thanks for the great course James.

Q1) I went for nonlinear regression analysis for structural reliability, and Box-Behnken RSM for Risk Based Design Optimisation (RBDO).

Hi Lexi…You are very welcome! Keep up the great work and let us know if we can answer any questions regarding the content.

Lesson 5: Hill Climbing

from numpy.random import randn

from numpy import asarray

bounds = asarray([[-5.0, 5.0], [-5.0, 5.0]])

n_iterations = 100

step_size = 0.1

def objective(x):

return x[0]**2.0 + x[1]**2.0

Results:

>0 f([-3.68579761 1.46120481]) = 15.72022

>1 f([-3.5723113 1.51378226]) = 15.05294

>3 f([-3.54142472 1.54298156]) = 14.92248

.

.

.

>92 f([-0.78315513 0.46426033]) = 0.82887

>93 f([-0.58243179 0.54524862]) = 0.63652

>95 f([-0.4682943 0.58038478]) = 0.55615

>99 f([-0.37237035 0.63859304]) = 0.54646

Lesson 6 : Simulated annealing

bounds = asarray([[-5.0, 5.0], [-5.0, 5.0]])

n_iterations = 100

step_size = 0.1

temp = -0.01

#Acley function

def objective(v):

x, y = v

return ( -20.0 * exp(-0.2 * sqrt(0.5 * (x**2 + y**2)))

– exp(0.5 * (cos(2*pi*x)+cos(2*pi*y))) + e + 20 )

Output results:

>0 f([0.1717444 0.27422098]) = 2.43902

>2 f([0.0853289 0.20363316]) = 1.55872

>5 f([0.07214107 0.1278015 ]) = 0.91070

>10 f([0.00746073 0.04047236]) = 0.16089

>14 f([-0.02684032 0.00021496]) = 0.09499

>44 f([ 0.00773109 -0.01292497]) = 0.04863

task_1:

———-

Let’s consider decision trees as a machine learning model. Decision trees are a popular type of model that can be used for both classification and regression problems. Here are three examples associated with this model:

Hyperparameter Tuning: Decision tree models have several important hyperparameters, such as tree depth, minimum sample count, and split criteria. Optimization methods can be employed to find the best values for these hyperparameters. For instance, optimization can be performed using techniques like grid search or Bayesian optimization to find the combination of hyperparameters that yields the best performance.

Feature Selection: Decision trees are often fed with numerous features, but sometimes it can be more effective to select the most important features instead of using all of them. Optimization methods can be used to determine the most crucial features and reduce the complexity of the model. This can enhance the overall performance of the model and accelerate the training process.

Training Time Optimization: Training decision trees on large datasets can be time-consuming. Optimization can be utilized to reduce training time or optimize memory usage. For example, it is possible to optimize training time by employing parallel computing or using specialized training algorithms.

These examples represent different scenarios where optimization can be applied to decision trees. However, regardless of the model chosen, it is important to remember that there are many situations in the training or utilization of most machine learning models where optimization can be beneficial.

Hi Peri…Thank you for your feedback! Let us know if we can help answer any question regarding the material!

For three Example which need to optimize it there are many algoithms in machine learning which have many parameter to optimize to get efficient job lije:

1- Random forest classifer which is construct by multiple desicion trees need to optimize many parameters like the number of decision tree which used in it and the minimum sample which need to splite internal node.

also the number of feature which need to use in it should by reduce and remove irrelevant feature to get more accurte prediction

2- ANN which donate to Artifical Nural network which have many parameters to be optimized like the number of layer inside NN and the number of Nurons inside each layer also the learning ration

3- Long Short-Term Memory (LSTM) which have many parameters to optimize in order to obtaine good accurate result like number of LSTM layers, the number of hidden units in each laye and the learning rate.

Hi moh…great examples! Let us know if you have any questions regarding our content.