Profiling Python Code

By Adrian Tam on June 21, 2022 in Python for Machine Learning 3

Profiling is a technique to figure out how time is spent in a program. With these statistics, we can find the “hot spot” of a program and think about ways of improvement. Sometimes, a hot spot in an unexpected location may hint at a bug in the program as well.

In this tutorial, we will see how we can use the profiling facility in Python. Specifically, you will see:

How we can compare small code fragments using the timeit module
How we can profile the entire program using the cProfile module
How we can invoke a profiler inside an existing program
What the profiler cannot do

Kick-start your project with my new book Python for Machine Learning, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Profiling Python Code. Photo by Prashant Saini. Some rights reserved.

Tutorial Overview

This tutorial is in four parts; they are:

Profiling small fragments
The profile module
Using profiler inside code
Caveats

Profiling Small Fragments

When you are asked about the different ways of doing the same thing in Python, one perspective is to check which one is more efficient. In Python’s standard library, we have the timeit module that allows us to do some simple profiling.

For example, to concatenate many short strings, we can use the join() function from strings or the + operator. So, how do we know which is faster? Consider the following Python code:

longstr = ""
for x in range(1000):
  longstr += str(x)

longstr = ""

for x in range(1000):

longstr += str(x)

This will produce a long string 012345.... in the variable longstr. An alternative way to write this is:

longstr = "".join([str(x) for x in range(1000)])

1	longstr = "".join([str(x) for x in range(1000)])

To compare the two, we can do the following at the command line:

python -m timeit 'longstr=""' 'for x in range(1000): longstr += str(x)'
python -m timeit '"".join([str(x) for x in range(1000)])'

1 2	python -m timeit 'longstr=""' 'for x in range(1000): longstr += str(x)' python -m timeit '"".join([str(x) for x in range(1000)])'

These two commands will produce the following output:

1000 loops, best of 5: 265 usec per loop
2000 loops, best of 5: 160 usec per loop

1 2	1000 loops, best of 5: 265 usec per loop 2000 loops, best of 5: 160 usec per loop

The above commands are to load the timeit module and pass on a single line of code for measurement. In the first case, we have two lines of statements, and they are passed on to the timeit module as two separate arguments. In the same rationale, the first command can also be presented as three lines of statements (by breaking the for-loop into two lines), but the indentation of each line needs to be quoted correctly:

python -m timeit 'longstr=""' 'for x in range(1000):' ' longstr += str(x)'

1	python -m timeit 'longstr=""' 'for x in range(1000):' ' longstr += str(x)'

The output of timeit is to find the best performance among multiple runs (default to be 5). Each run is to run the provided statements a few times (which is dynamically determined). The time is reported as the average to execute the statements once in the best run.

While it is true that the join function is faster than the + operator for string concatenation, the timing above is not a fair comparison. It is because we use str(x) to make short strings on the fly during the loop. The better way to do this is the following:

python -m timeit -s 'strings = [str(x) for x in range(1000)]' 'longstr=""' 'for x in strings:' ' longstr += str(x)'
python -m timeit -s 'strings = [str(x) for x in range(1000)]' '"".join(strings)'

1 2	python -m timeit -s 'strings = [str(x) for x in range(1000)]' 'longstr=""' 'for x in strings:' ' longstr += str(x)' python -m timeit -s 'strings = [str(x) for x in range(1000)]' '"".join(strings)'

which produces:

2000 loops, best of 5: 173 usec per loop
50000 loops, best of 5: 6.91 usec per loop

1 2	2000 loops, best of 5: 173 usec per loop 50000 loops, best of 5: 6.91 usec per loop

The -s option allows us to provide the “setup” code, which is executed before the profiling and not timed. In the above, we create the list of short strings before starting the loop. Hence the time to create those strings is not measured in the “per loop” timing. The above shows that the join() function is two orders of magnitude faster than the + operator. The more common use of the -s option is to import the libraries. For example, we can compare the square root function from Python’s math module from NumPy and use the exponential operator ** as follows:

python -m timeit '[x**0.5 for x in range(1000)]'
python -m timeit -s 'from math import sqrt' '[sqrt(x) for x in range(1000)]'
python -m timeit -s 'from numpy import sqrt' '[sqrt(x) for x in range(1000)]'

python -m timeit '[x**0.5 for x in range(1000)]'

python -m timeit -s 'from math import sqrt' '[sqrt(x) for x in range(1000)]'

python -m timeit -s 'from numpy import sqrt' '[sqrt(x) for x in range(1000)]'

The above produces the following measurement, which we see that math.sqrt() is fastest while numpy.sqrt() is slowest in this particular example:

5000 loops, best of 5: 93.2 usec per loop
5000 loops, best of 5: 72.3 usec per loop
200 loops, best of 5: 974 usec per loop

5000 loops, best of 5: 93.2 usec per loop

5000 loops, best of 5: 72.3 usec per loop

200 loops, best of 5: 974 usec per loop

If you wonder why NumPy is the slowest, it is because NumPy is optimized for arrays. You will see its exceptional speed in the following alternative:

python -m timeit -s 'import numpy as np; x=np.array(range(1000))' 'np.sqrt(x)'

1	python -m timeit -s 'import numpy as np; x=np.array(range(1000))' 'np.sqrt(x)'

where the result is:

100000 loops, best of 5: 2.08 usec per loop

1	100000 loops, best of 5: 2.08 usec per loop

If you prefer, you can also run timeit in Python code. For example, the following will be similar to the above but give you the total raw timing for each run:

import timeit
measurements = timeit.repeat('[x**0.5 for x in range(1000)]', number=10000)
print(measurements)

import timeit

measurements = timeit.repeat('[x**0.5 for x in range(1000)]', number=10000)

print(measurements)

In the above, each run is to execute the statement 10,000 times; the result is as follows. You can see the result of roughly 98 usec per loop in the best run:

[1.0888952040000106, 0.9799715450000122, 1.0921516899999801, 1.0946189250000202, 1.2792069260000005]

1	[1.0888952040000106, 0.9799715450000122, 1.0921516899999801, 1.0946189250000202, 1.2792069260000005]

The Profile Module

Focusing on a statement or two for performance is from a microscopic perspective. Chances are, we have a long program and want to see what is causing it to run slow. That happens before we can consider alternative statements or algorithms.

A program running slow can generally be due to two reasons: A part is running slow, or a part is running too many times, adding up and taking too much time. We call these “performance hogs” the hot spot. Let’s look at an example. Consider the following program that uses a hill-climbing algorithm to find hyperparameters for a perceptron model:

# manually search perceptron hyperparameters for binary classification
from numpy import mean
from numpy.random import randn
from numpy.random import rand
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.linear_model import Perceptron

# objective function
def objective(X, y, cfg):
	# unpack config
	eta, alpha = cfg
	# define model
	model = Perceptron(penalty='elasticnet', alpha=alpha, eta0=eta)
	# define evaluation procedure
	cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
	# evaluate model
	scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
	# calculate mean accuracy
	result = mean(scores)
	return result

# take a step in the search space
def step(cfg, step_size):
	# unpack the configuration
	eta, alpha = cfg
	# step eta
	new_eta = eta + randn() * step_size
	# check the bounds of eta
	if new_eta <= 0.0:
		new_eta = 1e-8
	if new_eta > 1.0:
		new_eta = 1.0
	# step alpha
	new_alpha = alpha + randn() * step_size
	# check the bounds of alpha
	if new_alpha < 0.0:
		new_alpha = 0.0
	# return the new configuration
	return [new_eta, new_alpha]

# hill climbing local search algorithm
def hillclimbing(X, y, objective, n_iter, step_size):
	# starting point for the search
	solution = [rand(), rand()]
	# evaluate the initial point
	solution_eval = objective(X, y, solution)
	# run the hill climb
	for i in range(n_iter):
		# take a step
		candidate = step(solution, step_size)
		# evaluate candidate point
		candidate_eval = objective(X, y, candidate)
		# check if we should keep the new point
		if candidate_eval >= solution_eval:
			# store the new point
			solution, solution_eval = candidate, candidate_eval
			# report progress
			print('>%d, cfg=%s %.5f' % (i, solution, solution_eval))
	return [solution, solution_eval]

# define dataset
X, y = make_classification(n_samples=1000, n_features=5, n_informative=2, n_redundant=1, random_state=1)
# define the total iterations
n_iter = 100
# step size in the search space
step_size = 0.1
# perform the hill climbing search
cfg, score = hillclimbing(X, y, objective, n_iter, step_size)
print('Done!')
print('cfg=%s: Mean Accuracy: %f' % (cfg, score))

# manually search perceptron hyperparameters for binary classification

from numpy import mean

from numpy.random import randn

from numpy.random import rand

from sklearn.datasets import make_classification

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import RepeatedStratifiedKFold

from sklearn.linear_model import Perceptron

# objective function

def objective(X, y, cfg):

# unpack config

eta, alpha = cfg

# define model

model = Perceptron(penalty='elasticnet', alpha=alpha, eta0=eta)

# define evaluation procedure

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# evaluate model

scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

# calculate mean accuracy

result = mean(scores)

return result

# take a step in the search space

def step(cfg, step_size):

# unpack the configuration

eta, alpha = cfg

# step eta

new_eta = eta + randn() * step_size

# check the bounds of eta

if new_eta <= 0.0:

new_eta = 1e-8

if new_eta > 1.0:

new_eta = 1.0

# step alpha

new_alpha = alpha + randn() * step_size

# check the bounds of alpha

if new_alpha < 0.0:

new_alpha = 0.0

# return the new configuration

return [new_eta, new_alpha]

# hill climbing local search algorithm

def hillclimbing(X, y, objective, n_iter, step_size):

# starting point for the search

solution = [rand(), rand()]

# evaluate the initial point

solution_eval = objective(X, y, solution)

# run the hill climb

for i in range(n_iter):

# take a step

candidate = step(solution, step_size)

# evaluate candidate point

candidate_eval = objective(X, y, candidate)

# check if we should keep the new point

if candidate_eval >= solution_eval:

# store the new point

solution, solution_eval = candidate, candidate_eval

# report progress

print('>%d, cfg=%s %.5f' % (i, solution, solution_eval))

return [solution, solution_eval]

# define dataset

X, y = make_classification(n_samples=1000, n_features=5, n_informative=2, n_redundant=1, random_state=1)

# define the total iterations

n_iter = 100

# step size in the search space

step_size = 0.1

# perform the hill climbing search

cfg, score = hillclimbing(X, y, objective, n_iter, step_size)

print('Done!')

print('cfg=%s: Mean Accuracy: %f' % (cfg, score))

Assume we saved this program in the file hillclimb.py, we can run the profiler in the command line as follows:

python -m cProfile hillclimb.py

1	python -m cProfile hillclimb.py

and the output will be the following:

>10, cfg=[0.3792455490265847, 0.21589566352848377] 0.78400
>17, cfg=[0.49105438202347707, 0.1342150084854657] 0.79833
>26, cfg=[0.5737524712834843, 0.016749795596210315] 0.80033
>47, cfg=[0.5067828976025809, 0.05280380038497864] 0.80133
>48, cfg=[0.5427345321546029, 0.0049895870979695875] 0.81167
Done!
cfg=[0.5427345321546029, 0.0049895870979695875]: Mean Accuracy: 0.811667
         2686451 function calls (2638255 primitive calls) in 5.500 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      101    0.001    0.000    4.892    0.048 hillclimb.py:11(objective)
        1    0.000    0.000    5.501    5.501 hillclimb.py:2(<module>)
      100    0.000    0.000    0.001    0.000 hillclimb.py:25(step)
        1    0.001    0.001    4.894    4.894 hillclimb.py:44(hillclimbing)
        1    0.000    0.000    0.000    0.000 <__array_function__ internals>:2(<module>)
      303    0.000    0.000    0.008    0.000 <__array_function__ internals>:2(all)
      303    0.000    0.000    0.005    0.000 <__array_function__ internals>:2(amin)
        2    0.000    0.000    0.000    0.000 <__array_function__ internals>:2(any)
        4    0.000    0.000    0.000    0.000 <__array_function__ internals>:2(atleast_1d)
     3333    0.003    0.000    0.018    0.000 <__array_function__ internals>:2(bincount)
      103    0.000    0.000    0.001    0.000 <__array_function__ internals>:2(concatenate)
        3    0.000    0.000    0.000    0.000 <__array_function__ internals>:2(copyto)
      606    0.001    0.000    0.010    0.000 <__array_function__ internals>:2(cumsum)
        6    0.000    0.000    0.000    0.000 <__array_function__ internals>:2(dot)
        1    0.000    0.000    0.000    0.000 <__array_function__ internals>:2(empty_like)
        1    0.000    0.000    0.000    0.000 <__array_function__ internals>:2(inv)
        2    0.000    0.000    0.000    0.000 <__array_function__ internals>:2(linspace)
        1    0.000    0.000    0.000    0.000 <__array_function__ internals>:2(lstsq)
      101    0.000    0.000    0.005    0.000 <__array_function__ internals>:2(mean)
        2    0.000    0.000    0.000    0.000 <__array_function__ internals>:2(ndim)
        1    0.000    0.000    0.000    0.000 <__array_function__ internals>:2(outer)
        1    0.000    0.000    0.000    0.000 <__array_function__ internals>:2(polyfit)
        1    0.000    0.000    0.000    0.000 <__array_function__ internals>:2(polyval)
        1    0.000    0.000    0.000    0.000 <__array_function__ internals>:2(prod)
      303    0.000    0.000    0.002    0.000 <__array_function__ internals>:2(ravel)
        2    0.000    0.000    0.000    0.000 <__array_function__ internals>:2(result_type)
      303    0.001    0.000    0.001    0.000 <__array_function__ internals>:2(shape)
      303    0.000    0.000    0.035    0.000 <__array_function__ internals>:2(sort)
        4    0.000    0.000    0.000    0.000 <__array_function__ internals>:2(trim_zeros)
     1617    0.002    0.000    0.112    0.000 <__array_function__ internals>:2(unique)
...

>10, cfg=[0.3792455490265847, 0.21589566352848377] 0.78400

>17, cfg=[0.49105438202347707, 0.1342150084854657] 0.79833

>26, cfg=[0.5737524712834843, 0.016749795596210315] 0.80033

>47, cfg=[0.5067828976025809, 0.05280380038497864] 0.80133

>48, cfg=[0.5427345321546029, 0.0049895870979695875] 0.81167

Done!

cfg=[0.5427345321546029, 0.0049895870979695875]: Mean Accuracy: 0.811667

2686451 function calls (2638255 primitive calls) in 5.500 seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function)

101 0.001 0.000 4.892 0.048 hillclimb.py:11(objective)

1 0.000 0.000 5.501 5.501 hillclimb.py:2(<module>)

100 0.000 0.000 0.001 0.000 hillclimb.py:25(step)

1 0.001 0.001 4.894 4.894 hillclimb.py:44(hillclimbing)

1 0.000 0.000 0.000 0.000 <__array_function__ internals>:2(<module>)

303 0.000 0.000 0.008 0.000 <__array_function__ internals>:2(all)

303 0.000 0.000 0.005 0.000 <__array_function__ internals>:2(amin)

2 0.000 0.000 0.000 0.000 <__array_function__ internals>:2(any)

4 0.000 0.000 0.000 0.000 <__array_function__ internals>:2(atleast_1d)

3333 0.003 0.000 0.018 0.000 <__array_function__ internals>:2(bincount)

103 0.000 0.000 0.001 0.000 <__array_function__ internals>:2(concatenate)

3 0.000 0.000 0.000 0.000 <__array_function__ internals>:2(copyto)

606 0.001 0.000 0.010 0.000 <__array_function__ internals>:2(cumsum)

6 0.000 0.000 0.000 0.000 <__array_function__ internals>:2(dot)

1 0.000 0.000 0.000 0.000 <__array_function__ internals>:2(empty_like)

1 0.000 0.000 0.000 0.000 <__array_function__ internals>:2(inv)

2 0.000 0.000 0.000 0.000 <__array_function__ internals>:2(linspace)

1 0.000 0.000 0.000 0.000 <__array_function__ internals>:2(lstsq)

101 0.000 0.000 0.005 0.000 <__array_function__ internals>:2(mean)

2 0.000 0.000 0.000 0.000 <__array_function__ internals>:2(ndim)

1 0.000 0.000 0.000 0.000 <__array_function__ internals>:2(outer)

1 0.000 0.000 0.000 0.000 <__array_function__ internals>:2(polyfit)

1 0.000 0.000 0.000 0.000 <__array_function__ internals>:2(polyval)

1 0.000 0.000 0.000 0.000 <__array_function__ internals>:2(prod)

303 0.000 0.000 0.002 0.000 <__array_function__ internals>:2(ravel)

2 0.000 0.000 0.000 0.000 <__array_function__ internals>:2(result_type)

303 0.001 0.000 0.001 0.000 <__array_function__ internals>:2(shape)

303 0.000 0.000 0.035 0.000 <__array_function__ internals>:2(sort)

4 0.000 0.000 0.000 0.000 <__array_function__ internals>:2(trim_zeros)

1617 0.002 0.000 0.112 0.000 <__array_function__ internals>:2(unique)

...

The normal output of the program will be printed first, and then the profiler’s statistics will be printed. From the first row, we see that the function objective() in our program has run 101 times and took 4.89 seconds. But these 4.89 seconds are mostly spent on the functions it called, which the total time spent on that function is merely 0.001 seconds. The functions from dependent modules are also profiled. Hence you see a lot of NumPy functions above too.

The above output is long and may not be useful to you as it can be difficult to tell which function is the hot spot. Indeed we can sort the above output. For example, to see which function is called the most number of times, we can sort by ncalls:

python -m cProfile -s ncalls hillclimb.py

1	python -m cProfile -s ncalls hillclimb.py

Its output is as follows: It says the get() function from a Python dict is the most used function (but it only consumed 0.03 seconds in total out of the 5.6 seconds to finish the program):

         2685349 function calls (2637153 primitive calls) in 5.609 seconds

   Ordered by: call count

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   247588    0.029    0.000    0.029    0.000 {method 'get' of 'dict' objects}
   246196    0.028    0.000    0.028    0.000 inspect.py:2548(name)
   168057    0.018    0.000    0.018    0.000 {method 'append' of 'list' objects}
   161738    0.018    0.000    0.018    0.000 inspect.py:2560(kind)
   144431    0.021    0.000    0.029    0.000 {built-in method builtins.isinstance}
   142213    0.030    0.000    0.031    0.000 {built-in method builtins.getattr}
...

2685349 function calls (2637153 primitive calls) in 5.609 seconds

Ordered by: call count

ncalls tottime percall cumtime percall filename:lineno(function)

247588 0.029 0.000 0.029 0.000 {method 'get' of 'dict' objects}

246196 0.028 0.000 0.028 0.000 inspect.py:2548(name)

168057 0.018 0.000 0.018 0.000 {method 'append' of 'list' objects}

161738 0.018 0.000 0.018 0.000 inspect.py:2560(kind)

144431 0.021 0.000 0.029 0.000 {built-in method builtins.isinstance}

142213 0.030 0.000 0.031 0.000 {built-in method builtins.getattr}

...

The other sort options are as follows:

Sort string	Meaning
calls	Call count
cumulative	Cumulative time
cumtime	Cumulative time
file	File name
filename	File name
module	File name
ncalls	Call count
pcalls	Primitive call count
line	Line number
name	Function name
nfl	Name/file/line
stdname	Standard name
time	Internal time
tottime	Internal time

If the program takes some time to finish, it is not reasonable to run the program many times just to find the profiling result in a different sort order. Indeed, we can save the profiler’s statistics for further processing as follows:

python -m cProfile -o hillclimb.stats hillclimb.py

1	python -m cProfile -o hillclimb.stats hillclimb.py

Similar to the above, it will run the program. But this will not print the statistics to the screen but save them into a file. Afterward, we can use the pstats module like the following to open up the statistics file and provide us a prompt to manipulate the data:

python -m pstats hillclimb.stats

1	python -m pstats hillclimb.stats

For example, we can use the sort command to change the sort order and use stats to print what we saw above:

Welcome to the profile statistics browser.
hillclimb.stat% help

Documented commands (type help <topic>):
========================================
EOF  add  callees  callers  help  quit  read  reverse  sort  stats  strip

hillclimb.stat% sort ncall
hillclimb.stat% stats hillclimb
Thu Jan 13 16:44:10 2022    hillclimb.stat

         2686227 function calls (2638031 primitive calls) in 5.582 seconds

   Ordered by: call count
   List reduced from 3456 to 4 due to restriction <'hillclimb'>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      101    0.001    0.000    4.951    0.049 hillclimb.py:11(objective)
      100    0.000    0.000    0.001    0.000 hillclimb.py:25(step)
        1    0.000    0.000    5.583    5.583 hillclimb.py:2(<module>)
        1    0.000    0.000    4.952    4.952 hillclimb.py:44(hillclimbing)

hillclimb.stat%

Welcome to the profile statistics browser.

hillclimb.stat% help

Documented commands (type help <topic>):

========================================

EOF add callees callers help quit read reverse sort stats strip

hillclimb.stat% sort ncall

hillclimb.stat% stats hillclimb

Thu Jan 13 16:44:10 2022 hillclimb.stat

2686227 function calls (2638031 primitive calls) in 5.582 seconds

Ordered by: call count

List reduced from 3456 to 4 due to restriction <'hillclimb'>

ncalls tottime percall cumtime percall filename:lineno(function)

101 0.001 0.000 4.951 0.049 hillclimb.py:11(objective)

100 0.000 0.000 0.001 0.000 hillclimb.py:25(step)

1 0.000 0.000 5.583 5.583 hillclimb.py:2(<module>)

1 0.000 0.000 4.952 4.952 hillclimb.py:44(hillclimbing)

hillclimb.stat%

You will notice that the stats command above allows us to provide an extra argument. The argument can be a regular expression to search for the functions such that only those matched will be printed. Hence it is a way to provide a search string to filter.

Want to Get Started With Python for Machine Learning?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

This pstats browser allows us to see more than just the table above. The callers and callees commands show us which function calls which function, how many times it is called, and how much time is spent. Hence we can consider that as a breakdown of the function-level statistics. It is useful if you have a lot of functions that call each other and want to know how the time is spent in different scenarios. For example, this shows that the objective() function is called only by the hillclimbing() function, but the hillclimbing() function calls several other functions:

hillclimb.stat% callers objective
   Ordered by: call count
   List reduced from 3456 to 1 due to restriction <'objective'>

Function                    was called by...
                                ncalls  tottime  cumtime
hillclimb.py:11(objective)  <-     101    0.001    4.951  hillclimb.py:44(hillclimbing)


hillclimb.stat% callees hillclimbing
   Ordered by: call count
   List reduced from 3456 to 1 due to restriction <'hillclimbing'>

Function                       called...
                                   ncalls  tottime  cumtime
hillclimb.py:44(hillclimbing)  ->     101    0.001    4.951  hillclimb.py:11(objective)
                                      100    0.000    0.001  hillclimb.py:25(step)
                                        4    0.000    0.000  {built-in method builtins.print}
                                        2    0.000    0.000  {method 'rand' of 'numpy.random.mtrand.RandomState' objects}


hillclimb.stat%

hillclimb.stat% callers objective

Ordered by: call count

List reduced from 3456 to 1 due to restriction <'objective'>

Function was called by...

ncalls tottime cumtime

hillclimb.py:11(objective) <- 101 0.001 4.951 hillclimb.py:44(hillclimbing)

hillclimb.stat% callees hillclimbing

Ordered by: call count

List reduced from 3456 to 1 due to restriction <'hillclimbing'>

Function called...

ncalls tottime cumtime

hillclimb.py:44(hillclimbing) -> 101 0.001 4.951 hillclimb.py:11(objective)

100 0.000 0.001 hillclimb.py:25(step)

4 0.000 0.000 {built-in method builtins.print}

2 0.000 0.000 {method 'rand' of 'numpy.random.mtrand.RandomState' objects}

hillclimb.stat%

Using Profiler Inside Code

The above example assumes you have the complete program saved in a file and profile the entire program. Sometimes, we focus on only a part of the entire program. For example, if we load a large module, it takes time to bootstrap, and we want to remove this from the profiler. In this case, we can invoke the profiler only for certain lines. An example is as follows, which is modified from the program above:

# manually search perceptron hyperparameters for binary classification
import cProfile as profile
import pstats
from numpy import mean
from numpy.random import randn
from numpy.random import rand
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.linear_model import Perceptron

# objective function
def objective(X, y, cfg):
	# unpack config
	eta, alpha = cfg
	# define model
	model = Perceptron(penalty='elasticnet', alpha=alpha, eta0=eta)
	# define evaluation procedure
	cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
	# evaluate model
	scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
	# calculate mean accuracy
	result = mean(scores)
	return result

# take a step in the search space
def step(cfg, step_size):
	# unpack the configuration
	eta, alpha = cfg
	# step eta
	new_eta = eta + randn() * step_size
	# check the bounds of eta
	if new_eta <= 0.0:
		new_eta = 1e-8
	if new_eta > 1.0:
		new_eta = 1.0
	# step alpha
	new_alpha = alpha + randn() * step_size
	# check the bounds of alpha
	if new_alpha < 0.0:
		new_alpha = 0.0
	# return the new configuration
	return [new_eta, new_alpha]

# hill climbing local search algorithm
def hillclimbing(X, y, objective, n_iter, step_size):
	# starting point for the search
	solution = [rand(), rand()]
	# evaluate the initial point
	solution_eval = objective(X, y, solution)
	# run the hill climb
	for i in range(n_iter):
		# take a step
		candidate = step(solution, step_size)
		# evaluate candidate point
		candidate_eval = objective(X, y, candidate)
		# check if we should keep the new point
		if candidate_eval >= solution_eval:
			# store the new point
			solution, solution_eval = candidate, candidate_eval
			# report progress
			print('>%d, cfg=%s %.5f' % (i, solution, solution_eval))
	return [solution, solution_eval]

# define dataset
X, y = make_classification(n_samples=1000, n_features=5, n_informative=2, n_redundant=1, random_state=1)
# define the total iterations
n_iter = 100
# step size in the search space
step_size = 0.1
# perform the hill climbing search with profiling
prof = profile.Profile()
prof.enable()
cfg, score = hillclimbing(X, y, objective, n_iter, step_size)
prof.disable()
# print program output
print('Done!')
print('cfg=%s: Mean Accuracy: %f' % (cfg, score))
# print profiling output
stats = pstats.Stats(prof).strip_dirs().sort_stats("cumtime")
stats.print_stats(10) # top 10 rows

# manually search perceptron hyperparameters for binary classification

import cProfile as profile

import pstats

from numpy import mean

from numpy.random import randn

from numpy.random import rand

from sklearn.datasets import make_classification

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import RepeatedStratifiedKFold

from sklearn.linear_model import Perceptron

# objective function

def objective(X, y, cfg):

# unpack config

eta, alpha = cfg

# define model

model = Perceptron(penalty='elasticnet', alpha=alpha, eta0=eta)

# define evaluation procedure

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# evaluate model

scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

# calculate mean accuracy

result = mean(scores)

return result

# take a step in the search space

def step(cfg, step_size):

# unpack the configuration

eta, alpha = cfg

# step eta

new_eta = eta + randn() * step_size

# check the bounds of eta

if new_eta <= 0.0:

new_eta = 1e-8

if new_eta > 1.0:

new_eta = 1.0

# step alpha

new_alpha = alpha + randn() * step_size

# check the bounds of alpha

if new_alpha < 0.0:

new_alpha = 0.0

# return the new configuration

return [new_eta, new_alpha]

# hill climbing local search algorithm

def hillclimbing(X, y, objective, n_iter, step_size):

# starting point for the search

solution = [rand(), rand()]

# evaluate the initial point

solution_eval = objective(X, y, solution)

# run the hill climb

for i in range(n_iter):

# take a step

candidate = step(solution, step_size)

# evaluate candidate point

candidate_eval = objective(X, y, candidate)

# check if we should keep the new point

if candidate_eval >= solution_eval:

# store the new point

solution, solution_eval = candidate, candidate_eval

# report progress

print('>%d, cfg=%s %.5f' % (i, solution, solution_eval))

return [solution, solution_eval]

# define dataset

X, y = make_classification(n_samples=1000, n_features=5, n_informative=2, n_redundant=1, random_state=1)

# define the total iterations

n_iter = 100

# step size in the search space

step_size = 0.1

# perform the hill climbing search with profiling

prof = profile.Profile()

prof.enable()

cfg, score = hillclimbing(X, y, objective, n_iter, step_size)

prof.disable()

# print program output

print('Done!')

print('cfg=%s: Mean Accuracy: %f' % (cfg, score))

# print profiling output

stats = pstats.Stats(prof).strip_dirs().sort_stats("cumtime")

stats.print_stats(10) # top 10 rows

It will output the following:

>0, cfg=[0.3776271076534661, 0.2308364063203663] 0.75700
>3, cfg=[0.35803234662466354, 0.03204434939660264] 0.77567
>8, cfg=[0.3001050823005957, 0.0] 0.78633
>10, cfg=[0.39518618870158934, 0.0] 0.78633
>12, cfg=[0.4291267905390187, 0.0] 0.78633
>13, cfg=[0.4403131521968569, 0.0] 0.78633
>16, cfg=[0.38865272555918756, 0.0] 0.78633
>17, cfg=[0.38871654921891885, 0.0] 0.78633
>18, cfg=[0.4542440671724224, 0.0] 0.78633
>19, cfg=[0.44899743344802734, 0.0] 0.78633
>20, cfg=[0.5855375509507891, 0.0] 0.78633
>21, cfg=[0.5935318064858227, 0.0] 0.78633
>23, cfg=[0.7606367310048543, 0.0] 0.78633
>24, cfg=[0.855444293727846, 0.0] 0.78633
>25, cfg=[0.9505501566826242, 0.0] 0.78633
>26, cfg=[1.0, 0.0244821888204496] 0.79800
Done!
cfg=[1.0, 0.0244821888204496]: Mean Accuracy: 0.798000
         2179559 function calls (2140124 primitive calls) in 4.941 seconds

   Ordered by: cumulative time
   List reduced from 581 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001    4.941    4.941 hillclimb.py:46(hillclimbing)
      101    0.001    0.000    4.939    0.049 hillclimb.py:13(objective)
      101    0.001    0.000    4.931    0.049 _validation.py:375(cross_val_score)
      101    0.002    0.000    4.930    0.049 _validation.py:48(cross_validate)
      101    0.005    0.000    4.903    0.049 parallel.py:960(__call__)
      101    0.235    0.002    3.089    0.031 parallel.py:920(retrieve)
     3030    0.004    0.000    2.849    0.001 _parallel_backends.py:537(wrap_future_result)
     3030    0.020    0.000    2.845    0.001 _base.py:417(result)
     2602    0.016    0.000    2.819    0.001 threading.py:280(wait)
    12447    2.796    0.000    2.796    0.000 {method 'acquire' of '_thread.lock' objects}

>0, cfg=[0.3776271076534661, 0.2308364063203663] 0.75700

>3, cfg=[0.35803234662466354, 0.03204434939660264] 0.77567

>8, cfg=[0.3001050823005957, 0.0] 0.78633

>10, cfg=[0.39518618870158934, 0.0] 0.78633

>12, cfg=[0.4291267905390187, 0.0] 0.78633

>13, cfg=[0.4403131521968569, 0.0] 0.78633

>16, cfg=[0.38865272555918756, 0.0] 0.78633

>17, cfg=[0.38871654921891885, 0.0] 0.78633

>18, cfg=[0.4542440671724224, 0.0] 0.78633

>19, cfg=[0.44899743344802734, 0.0] 0.78633

>20, cfg=[0.5855375509507891, 0.0] 0.78633

>21, cfg=[0.5935318064858227, 0.0] 0.78633

>23, cfg=[0.7606367310048543, 0.0] 0.78633

>24, cfg=[0.855444293727846, 0.0] 0.78633

>25, cfg=[0.9505501566826242, 0.0] 0.78633

>26, cfg=[1.0, 0.0244821888204496] 0.79800

Done!

cfg=[1.0, 0.0244821888204496]: Mean Accuracy: 0.798000

2179559 function calls (2140124 primitive calls) in 4.941 seconds

Ordered by: cumulative time

List reduced from 581 to 10 due to restriction <10>

ncalls tottime percall cumtime percall filename:lineno(function)

1 0.001 0.001 4.941 4.941 hillclimb.py:46(hillclimbing)

101 0.001 0.000 4.939 0.049 hillclimb.py:13(objective)

101 0.001 0.000 4.931 0.049 _validation.py:375(cross_val_score)

101 0.002 0.000 4.930 0.049 _validation.py:48(cross_validate)

101 0.005 0.000 4.903 0.049 parallel.py:960(__call__)

101 0.235 0.002 3.089 0.031 parallel.py:920(retrieve)

3030 0.004 0.000 2.849 0.001 _parallel_backends.py:537(wrap_future_result)

3030 0.020 0.000 2.845 0.001 _base.py:417(result)

2602 0.016 0.000 2.819 0.001 threading.py:280(wait)

12447 2.796 0.000 2.796 0.000 {method 'acquire' of '_thread.lock' objects}

Caveats

Using profiler with Tensorflow models may not produce what you would expect, especially if you have written your own custom layer or custom function for the model. If you did it correctly, Tensorflow is supposed to build the computation graph before your model is executed, and hence the logic will be changed. The profiler output will therefore not show your custom classes.

It’s the same for some advanced modules that involve binary code. The profiler can see you called some functions and marked them as “built-in” methods, but it cannot go any further into the compiled code.

Below is a short code of the LeNet5 model for the MNIST classification problem. If you try to profile it and print the top 15 rows, you will see that a wrapper is occupying the majority of the time, and nothing can be shown beyond that:

import numpy as np
import tensorflow as tf
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, Dense, AveragePooling2D, Flatten
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.callbacks import EarlyStopping

# Load and reshape data to shape of (n_sample, height, width, n_channel)
(X_train, y_train), (X_test, y_test) = mnist.load_data()
X_train = np.expand_dims(X_train, axis=3).astype('float32')
X_test = np.expand_dims(X_test, axis=3).astype('float32')

# One-hot encode the output
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

# LeNet5 model
model = Sequential([
    Conv2D(6, (5,5), input_shape=(28,28,1), padding="same", activation="tanh"),
    AveragePooling2D((2,2), strides=2),
    Conv2D(16, (5,5), activation="tanh"),
    AveragePooling2D((2,2), strides=2),
    Conv2D(120, (5,5), activation="tanh"),
    Flatten(),
    Dense(84, activation="tanh"),
    Dense(10, activation="softmax")
])
model.summary(line_length=100)

# Training
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
earlystopping = EarlyStopping(monitor="val_loss", patience=2, restore_best_weights=True)
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=20, batch_size=32, callbacks=[earlystopping])

# Evaluate
print(model.evaluate(X_test, y_test, verbose=0))

import numpy as np

import tensorflow as tf

from tensorflow.keras.datasets import mnist

from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import Conv2D, Dense, AveragePooling2D, Flatten

from tensorflow.keras.utils import to_categorical

from tensorflow.keras.callbacks import EarlyStopping

# Load and reshape data to shape of (n_sample, height, width, n_channel)

(X_train, y_train), (X_test, y_test) = mnist.load_data()

X_train = np.expand_dims(X_train, axis=3).astype('float32')

X_test = np.expand_dims(X_test, axis=3).astype('float32')

# One-hot encode the output

y_train = to_categorical(y_train)

y_test = to_categorical(y_test)

# LeNet5 model

model = Sequential([

Conv2D(6, (5,5), input_shape=(28,28,1), padding="same", activation="tanh"),

AveragePooling2D((2,2), strides=2),

Conv2D(16, (5,5), activation="tanh"),

AveragePooling2D((2,2), strides=2),

Conv2D(120, (5,5), activation="tanh"),

Flatten(),

Dense(84, activation="tanh"),

Dense(10, activation="softmax")

])

model.summary(line_length=100)

# Training

model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

earlystopping = EarlyStopping(monitor="val_loss", patience=2, restore_best_weights=True)

model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=20, batch_size=32, callbacks=[earlystopping])

# Evaluate

print(model.evaluate(X_test, y_test, verbose=0))

In the result below, the TFE_Py_Execute is marked as a “built-in” method, and it consumes 30.1 sec out of the total run time of 39.6 sec. Note that the tottime is the same as the cumtime, meaning from the profiler’s perspective, it seems all time is spent at this function, and it doesn’t call any other functions. This illustrates the limitation of Python’s profiler.

         5962698 function calls (5728324 primitive calls) in 39.674 seconds

   Ordered by: cumulative time
   List reduced from 12295 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   3212/1    0.013    0.000   39.699   39.699 {built-in method builtins.exec}
        1    0.003    0.003   39.699   39.699 mnist.py:4(<module>)
     52/4    0.005    0.000   35.470    8.868 /usr/local/lib/python3.9/site-packages/keras/utils/traceback_utils.py:58(error_handler)
        1    0.089    0.089   34.334   34.334 /usr/local/lib/python3.9/site-packages/keras/engine/training.py:901(fit)
11075/9531    0.032    0.000   33.406    0.004 /usr/local/lib/python3.9/site-packages/tensorflow/python/util/traceback_utils.py:138(error_handler)
     4689    0.089    0.000   33.017    0.007 /usr/local/lib/python3.9/site-packages/tensorflow/python/eager/def_function.py:882(__call__)
     4689    0.023    0.000   32.771    0.007 /usr/local/lib/python3.9/site-packages/tensorflow/python/eager/def_function.py:929(_call)
     4688    0.042    0.000   32.134    0.007 /usr/local/lib/python3.9/site-packages/tensorflow/python/eager/function.py:3125(__call__)
     4689    0.075    0.000   30.941    0.007 /usr/local/lib/python3.9/site-packages/tensorflow/python/eager/function.py:1888(_call_flat)
     4689    0.158    0.000   30.472    0.006 /usr/local/lib/python3.9/site-packages/tensorflow/python/eager/function.py:553(call)
     4689    0.034    0.000   30.152    0.006 /usr/local/lib/python3.9/site-packages/tensorflow/python/eager/execute.py:33(quick_execute)
     4689   30.105    0.006   30.105    0.006 {built-in method tensorflow.python._pywrap_tfe.TFE_Py_Execute}
  3185/24    0.021    0.000    3.902    0.163 <frozen importlib._bootstrap>:1002(_find_and_load)
  3169/10    0.014    0.000    3.901    0.390 <frozen importlib._bootstrap>:967(_find_and_load_unlocked)
  2885/12    0.009    0.000    3.901    0.325 <frozen importlib._bootstrap_external>:844(exec_module)

5962698 function calls (5728324 primitive calls) in 39.674 seconds

Ordered by: cumulative time

List reduced from 12295 to 15 due to restriction <15>

ncalls tottime percall cumtime percall filename:lineno(function)

3212/1 0.013 0.000 39.699 39.699 {built-in method builtins.exec}

1 0.003 0.003 39.699 39.699 mnist.py:4(<module>)

52/4 0.005 0.000 35.470 8.868 /usr/local/lib/python3.9/site-packages/keras/utils/traceback_utils.py:58(error_handler)

1 0.089 0.089 34.334 34.334 /usr/local/lib/python3.9/site-packages/keras/engine/training.py:901(fit)

11075/9531 0.032 0.000 33.406 0.004 /usr/local/lib/python3.9/site-packages/tensorflow/python/util/traceback_utils.py:138(error_handler)

4689 0.089 0.000 33.017 0.007 /usr/local/lib/python3.9/site-packages/tensorflow/python/eager/def_function.py:882(__call__)

4689 0.023 0.000 32.771 0.007 /usr/local/lib/python3.9/site-packages/tensorflow/python/eager/def_function.py:929(_call)

4688 0.042 0.000 32.134 0.007 /usr/local/lib/python3.9/site-packages/tensorflow/python/eager/function.py:3125(__call__)

4689 0.075 0.000 30.941 0.007 /usr/local/lib/python3.9/site-packages/tensorflow/python/eager/function.py:1888(_call_flat)

4689 0.158 0.000 30.472 0.006 /usr/local/lib/python3.9/site-packages/tensorflow/python/eager/function.py:553(call)

4689 0.034 0.000 30.152 0.006 /usr/local/lib/python3.9/site-packages/tensorflow/python/eager/execute.py:33(quick_execute)

4689 30.105 0.006 30.105 0.006 {built-in method tensorflow.python._pywrap_tfe.TFE_Py_Execute}

3185/24 0.021 0.000 3.902 0.163 <frozen importlib._bootstrap>:1002(_find_and_load)

3169/10 0.014 0.000 3.901 0.390 <frozen importlib._bootstrap>:967(_find_and_load_unlocked)

2885/12 0.009 0.000 3.901 0.325 <frozen importlib._bootstrap_external>:844(exec_module)

Finally, Python’s profiler gives you only the statistics on time but not memory usage. You may need to look for another library or tools for this purpose.

Summary

In this tutorial, we learned what a profiler is and what it can do. Specifically,

We know how to compare small code with the timeit module
We see Python’s cProfile module can provide us with detailed statistics on how time is spent
We learned to use the pstats module against the output of cProfile to sort or filter

3 Responses to Profiling Python Code

simo June 7, 2022 at 2:54 am #

Hey Adrian,
nice article.
In my opinion would be better to go deeper in Cython as a way to speed up code.

Jürgen A. Erhard June 18, 2022 at 4:02 am #

No mention of line-profiler. Bad article.

- Luis January 20, 2023 at 7:38 pm #
  
  What a nice way to value someone else’s work.

Navigation

Profiling Python Code

Tutorial Overview

Profiling Small Fragments

The Profile Module

Want to Get Started With Python for Machine Learning?

Using Profiler Inside Code

Caveats

Further Readings

Summary

Get a Handle on Python for Machine Learning!

Be More Confident to Code in Python

Showing You the Python Toolbox at a High Level for
Your Projects

More On This Topic

3 Responses to Profiling Python Code

Leave a Reply Click here to cancel reply.

Navigation

Tutorial Overview

Profiling Small Fragments

The Profile Module

Want to Get Started With Python for Machine Learning?

Using Profiler Inside Code

Caveats

Further Readings

Summary

Get a Handle on Python for Machine Learning!

Be More Confident to Code in Python

Showing You the Python Toolbox at a High Level for Your Projects

More On This Topic

3 Responses to Profiling Python Code

Leave a Reply Click here to cancel reply.

Showing You the Python Toolbox at a High Level for
Your Projects