When we work on a machine learning project, we quite often need to experiment with multiple alternatives. Some features in Python allow us to try out different options without much effort. In this tutorial, we are going to see some tips to make our experiments faster.
After finishing this tutorial, you will learn:
- How to leverage a duck-typing feature to easily swap functions and objects
- How making components into drop-in replacements for each other can help experiments run faster
Kick-start your project with my new book Python for Machine Learning, including step-by-step tutorials and the Python source code files for all examples.
Let’s get started.Overview
This tutorial is in three parts; they are:
- Workflow of a machine learning project
- Functions as objects
- Caveats
Workflow of a Machine Learning Project
Consider a very simple machine learning project as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.svm import SVC # Load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv" names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class'] dataset = read_csv(url, names=names) # Split-out validation dataset array = dataset.values X = array[:,0:4] y = array[:,4] X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.20, random_state=1, shuffle=True) # Train clf = SVC() clf.fit(X_train, y_train) # Test score = clf.score(X_val, y_val) print("Validation accuracy", score) |
This is a typical machine learning project workflow. We have a stage of preprocessing the data, then training a model, and afterward, evaluating our result. But in each step, we may want to try something different. For example, we may wonder if normalizing the data would make it better. So we may rewrite the code above into the following:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.svm import SVC from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler # Load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv" names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class'] dataset = read_csv(url, names=names) # Split-out validation dataset array = dataset.values X = array[:,0:4] y = array[:,4] X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.20, random_state=1, shuffle=True) # Train clf = Pipeline([('scaler',StandardScaler()), ('classifier',SVC())]) clf.fit(X_train, y_train) # Test score = clf.score(X_val, y_val) print("Validation accuracy", score) |
So far, so good. But what if we keep experimenting with different datasets, different models, or different score functions? Each time, we keep flipping between using a scaler and not would mean a lot of code change, and it would be quite easy to make mistakes.
Because Python supports duck typing, we can see that the following two classifier models implemented the same interface:
1 2 |
clf = SVC() clf = Pipeline([('scaler',StandardScaler()), ('classifier',SVC())]) |
Therefore, we can simply select between these two version and keep everything intact. We can say these two models are drop-in replacements for each other.
Making use of this property, we can create a toggle variable to control the design choice we make:
1 2 3 4 5 6 |
USE_SCALER = True if USE_SCALER: clf = Pipeline([('scaler',StandardScaler()), ('classifier',SVC())]) else: clf = SVC() |
By toggling the variable USE_SCALER
between True
and False
, we can select whether a scaler should be applied. A more complex example would be to select among different scaler and the classifier models, such as:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
SCALER = "standard" CLASSIFIER = "svc" if CLASSIFIER == "svc": model = SVC() elif CLASSIFIER == "cart": model = DecisionTreeClassifier() else: raise NotImplementedError if SCALER == "standard": clf = Pipeline([('scaler',StandardScaler()), ('classifier',model)]) elif SCALER == "maxmin": clf = Pipeline([('scaler',MaxMinScaler()), ('classifier',model)]) elif SCALER == None: clf = model else: raise NotImplementedError |
A complete example is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.svm import SVC from sklearn.tree import DecisionTreeClassifier from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler, MinMaxScaler # toggle between options SCALER = "maxmin" # "standard", "maxmin", or None CLASSIFIER = "cart" # "svc" or "cart" # Load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv" names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class'] dataset = read_csv(url, names=names) # Split-out validation dataset array = dataset.values X = array[:,0:4] y = array[:,4] X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.20, random_state=1, shuffle=True) # Create model if CLASSIFIER == "svc": model = SVC() elif CLASSIFIER == "cart": model = DecisionTreeClassifier() else: raise NotImplementedError if SCALER == "standard": clf = Pipeline([('scaler',StandardScaler()), ('classifier',model)]) elif SCALER == "maxmin": clf = Pipeline([('scaler',MinMaxScaler()), ('classifier',model)]) elif SCALER == None: clf = model else: raise NotImplementedError # Train clf.fit(X_train, y_train) # Test score = clf.score(X_val, y_val) print("Validation accuracy", score) |
If you go one step further, you may even skip the toggle variable and use a string directly for a quick experiment:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
import numpy as np import scipy.stats as stats # Covariance matrix and Cholesky decomposition cov = np.array([[1, 0.8], [0.8, 1]]) L = np.linalg.cholesky(cov) # Generate 100 pairs of bi-variate Gaussian random numbers if not "USE SCIPY": z = np.random.randn(100,2) x = z @ L.T else: x = stats.multivariate_normal(mean=[0, 0], cov=cov).rvs(100) ... |
Functions as Objects
In Python, functions are first-class citizens. You can assign functions to a variable. Indeed, functions are objects in Python, as are classes (the classes themselves, not only incarnations of classes). Therefore, we can use the same technique as above to experiment with similar functions.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
import numpy as np DIST = "normal" if DIST == "normal": rangen = np.random.normal elif DIST == "uniform": rangen = np.random.uniform else: raise NotImplementedError random_data = rangen(size=(10,5)) print(random_data) |
The above is similar to calling np.random.normal(size=(10,5))
, but we hold the function in a variable for the convenience of swapping one function with another. Note that since we call the functions with the same argument, we have to make sure all variations will accept it. In case it is not, we may need some additional lines of code to make a wrapper. For example, in the case of generating Student’s t distribution, we need an additional parameter for the degree of freedom:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
import numpy as np DIST = "t" if DIST == "normal": rangen = np.random.normal elif DIST == "uniform": rangen = np.random.uniform elif DIST == "t": def t_wrapper(size): # Student's t distribution with 3 degree of freedom return np.random.standard_t(df=3, size=size) rangen = t_wrapper else: raise NotImplementedError random_data = rangen(size=(10,5)) print(random_data) |
This works because in the above, np.random.normal
, np.random.uniform
, and t_wrapper
as we defined, are all drop-in replacements of each other.
Want to Get Started With Python for Machine Learning?
Take my free 7-day email crash course now (with sample code).
Click to sign-up and also get a free PDF Ebook version of the course.
Caveats
Machine learning differs from other programming projects because there are more uncertainties in the workflow. When you build a web page or build a game, you have a picture in your mind of what to achieve. But there is some exploratory work in machine learning projects.
You will probably use some source code control system like git or Mercurial to manage your source code development history in other projects. In machine learning projects, however, we are trying out different combinations of many steps. Using git to manage the different variations may not fit, not to say sometimes may be overkill. Therefore, using a toggle variable to control the flow should allow us to try out different things faster. This is especially handy when we are working on our projects in Jupyter notebooks.
However, as we put multiple versions of code together, we made the program clumsy and less readable. It is better to do some clean-up after we confirm what to do. This will help us with maintenance in the future.
Further reading
This section provides more resources on the topic if you are looking to go deeper.
Books
- Fluent Python, second edition, by Luciano Ramalho, https://www.amazon.com/dp/1492056359/
Summary
In this tutorial, you’ve seen how the duck typing property in Python helps us create drop-in replacements. Specifically, you learned:
- Duck typing can help us switch between alternatives easily in a machine learning workflow
- We can make use of a toggle variable to experiment among alternatives
i will surely give it a try. Tahnk you
You are very welcome Amol!