A Gentle Introduction to Model Selection for Machine Learning

Given easy-to-use machine learning libraries like scikit-learn and Keras, it is straightforward to fit many different machine learning models on a given predictive modeling dataset.

The challenge of applied machine learning, therefore, becomes how to choose among a range of different models that you can use for your problem.

Naively, you might believe that model performance is sufficient, but should you consider other concerns, such as how long the model takes to train or how easy it is to explain to project stakeholders. Their concerns become more pressing if a chosen model must be used operationally for months or years.

Also, what are you choosing exactly: just the algorithm used to fit the model or the entire data preparation and model fitting pipeline?

In this post, you will discover the challenge of model selection for machine learning.

After reading this post, you will know:

  • Model selection is the process of choosing one among many candidate models for a predictive modeling problem.
  • There may be many competing concerns when performing model selection beyond model performance, such as complexity, maintainability, and available resources.
  • The two main classes of model selection techniques are probabilistic measures and resampling methods.

Let’s get started.

A Gentle Introduction to Model Selection for Machine Learning

A Gentle Introduction to Model Selection for Machine Learning
Photo by Bernard Spragg. NZ, some rights reserved.

Overview

This tutorial is divided into three parts; they are:

  1. What Is Model Selection
  2. Considerations for Model Selection
  3. Model Selection Techniques

What Is Model Selection

Model selection is the process of selecting one final machine learning model from among a collection of candidate machine learning models for a training dataset.

Model selection is a process that can be applied both across different types of models (e.g. logistic regression, SVM, KNN, etc.) and across models of the same type configured with different model hyperparameters (e.g. different kernels in an SVM).

When we have a variety of models of different complexity (e.g., linear or logistic regression models with different degree polynomials, or KNN classifiers with different values of K), how should we pick the right one?

— Page 22, Machine Learning: A Probabilistic Perspective, 2012.

For example, we may have a dataset for which we are interested in developing a classification or regression predictive model. We do not know beforehand as to which model will perform best on this problem, as it is unknowable. Therefore, we fit and evaluate a suite of different models on the problem.

Model selection is the process of choosing one of the models as the final model that addresses the problem.

Model selection is different from model assessment.

For example, we evaluate or assess candidate models in order to choose the best one, and this is model selection. Whereas once a model is chosen, it can be evaluated in order to communicate how well it is expected to perform in general; this is model assessment.

The process of evaluating a model’s performance is known as model assessment, whereas the process of selecting the proper level of flexibility for a model is known as model selection.

— Page 175, An Introduction to Statistical Learning: with Applications in R, 2017.

Considerations for Model Selection

Fitting models is relatively straightforward, although selecting among them is the true challenge of applied machine learning.

Firstly, we need to get over the idea of a “best” model.

All models have some predictive error, given the statistical noise in the data, the incompleteness of the data sample, and the limitations of each different model type. Therefore, the notion of a perfect or best model is not useful. Instead, we must seek a model that is “good enough.”

What do we care about when choosing a final model?

The project stakeholders may have specific requirements, such as maintainability and limited model complexity. As such, a model that has lower skill but is simpler and easier to understand may be preferred.

Alternately, if model skill is prized above all other concerns, then the ability of the model to perform well on out-of-sample data will be preferred regardless of the computational complexity involved.

Therefore, a “good enough” model may refer to many things and is specific to your project, such as:

  • A model that meets the requirements and constraints of project stakeholders.
  • A model that is sufficiently skillful given the time and resources available.
  • A model that is skillful as compared to naive models.
  • A model that is skillful relative to other tested models.
  • A model that is skillful relative to the state-of-the-art.

Next, we must consider what is being selected.

For example, we are not selecting a fit model, as all models will be discarded. This is because once we choose a model, we will fit a new final model on all available data and start using it to make predictions.

Therefore, are we choosing among algorithms used to fit the models on the training dataset?

Some algorithms require specialized data preparation in order to best expose the structure of the problem to the learning algorithm. Therefore, we must go one step further and consider model selection as the process of selecting among model development pipelines.

Each pipeline may take in the same raw training dataset and outputs a model that can be evaluated in the same manner but may require different or overlapping computational steps, such as:

  • Data filtering.
  • Data transformation.
  • Feature selection.
  • Feature engineering.
  • And more…

The closer you look at the challenge of model selection, the more nuance you will discover.

Now that we are familiar with some considerations involved in model selection, let’s review some common methods for selecting a model.

Model Selection Techniques

The best approach to model selection requires “sufficient” data, which may be nearly infinite depending on the complexity of the problem.

In this ideal situation, we would split the data into training, validation, and test sets, then fit candidate models on the training set, evaluate and select them on the validation set, and report the performance of the final model on the test set.

If we are in a data-rich situation, the best approach […] is to randomly divide the dataset into three parts: a training set, a validation set, and a test set. The training set is used to fit the models; the validation set is used to estimate prediction error for model selection; the test set is used for assessment of the generalization error of the final chosen model.

— Page 222, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2017.

This is impractical on most predictive modeling problems given that we rarely have sufficient data, or are able to even judge what would be sufficient.

In many applications, however, the supply of data for training and testing will be limited, and in order to build good models, we wish to use as much of the available data as possible for training. However, if the validation set is small, it will give a relatively noisy estimate of predictive performance.

– Page 32, Pattern Recognition and Machine Learning, 2006.

Instead, there are two main classes of techniques to approximate the ideal case of model selection; they are:

  • Probabilistic Measures: Choose a model via in-sample error and complexity.
  • Resampling Methods: Choose a model via estimated out-of-sample error.

Let’s take a closer look at each in turn.

Probabilistic Measures

Probabilistic measures involve analytically scoring a candidate model using both its performance on the training dataset and the complexity of the model.

It is known that training error is optimistically biased, and therefore is not a good basis for choosing a model. The performance can be penalized based on how optimistic the training error is believed to be. This is typically achieved using algorithm-specific methods, often linear, that penalize the score based on the complexity of the model.

Historically various ‘information criteria’ have been proposed that attempt to correct for the bias of maximum likelihood by the addition of a penalty term to compensate for the over-fitting of more complex models.

– Page 33, Pattern Recognition and Machine Learning, 2006.

A model with fewer parameters is less complex, and because of this, is preferred because it is likely to generalize better on average.

Four commonly used probabilistic model selection measures include:

  • Akaike Information Criterion (AIC).
  • Bayesian Information Criterion (BIC).
  • Minimum Description Length (MDL).
  • Structural Risk Minimization (SRM).

Probabilistic measures are appropriate when using simpler linear models like linear regression or logistic regression where the calculating of model complexity penalty (e.g. in sample bias) is known and tractable.

Resampling Methods

Resampling methods seek to estimate the performance of a model (or more precisely, the model development process) on out-of-sample data.

This is achieved by splitting the training dataset into sub train and test sets, fitting a model on the sub train set, and evaluating it on the test set. This process may then be repeated multiple times and the mean performance across each trial is reported.

It is a type of Monte Carlo estimate of model performance on out-of-sample data, although each trial is not strictly independent as depending on the resampling method chosen, the same data may appear multiple times in different training datasets, or test datasets.

Three common resampling model selection methods include:

Most of the time probabilistic measures (described in the previous section) are not available, therefore resampling methods are used.

By far the most popular is the cross-validation family of methods that includes many subtypes.

Probably the simplest and most widely used method for estimating prediction error is cross-validation.

— Page 241, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2017.

An example is the widely used k-fold cross-validation that splits the training dataset into k folds where each example appears in a test set only once.

Another is the leave one out (LOOCV) where the test set is comprised of a single sample and each sample is given an opportunity to be the test set, requiring N (the number of samples in the training set) models to be constructed and evaluated.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Tutorials

Books

Articles

Summary

In this post, you discovered the challenge of model selection for machine learning.

Specifically, you learned:

  • Model selection is the process of choosing one among many candidate models for a predictive modeling problem.
  • There may be many competing concerns when performing model selection beyond model performance, such as complexity, maintainability, and available resources.
  • The two main classes of model selection techniques are probabilistic measures and resampling methods.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

23 Responses to A Gentle Introduction to Model Selection for Machine Learning

  1. Avatar
    marco December 3, 2019 at 12:57 am #

    Hello Jason,
    I’ve installed Tensorflow: 2.0.0-alpha0. Do I have to upgrade to tensorflow 2.0?
    How to upgrade it from alpha0 to the new release? (or is better to leave it as is?).
    I did not find any article about to upgrade it 🙁
    Thanks,
    Marco

  2. Avatar
    Atieh December 10, 2019 at 3:26 am #

    it was an amazing tutorial.
    Thanks a bunch.

  3. Avatar
    Achilles December 12, 2019 at 4:12 am #

    Hey Jason. your page have been helping me quite a lot so far that let me choose and writing a scholar thesis about machine learning, which is not my major. but the problem is, all the basic python book, deep learning book and youtube didnt explain how confusing what model i should use or my input/output excel sheet is going to be. and now i found your webpage few weeks ago and had been learning a lot more systemically. that’s good. however i haven’t found yet how to pass my input to a model, and what functional API i need to use. here is explaination briefly about model i need to build and the excel file.

    there are 20 of c and p values that will be paired to apply to see how much displacement is happend (z). across x and y axis

    input tab = 1st row has 2 variables(c, p) and 20 value for each.
    x coordinate tab = 1st row is just an index A~T where (c, p) paired 20 inputs located (so 20 column). and 60 rows of x-coordinate
    y coordinate tab = 1st row is just an index A~T where (c, p) paired 20 inputs located (so 20 column). and 60 rows of y-coordinate
    displacement tab = 1st row is ‘displacement’ which has only 1 column.

    Thank you for reading my problem. any keyword for further reading or link of page will be so much appreciated !

    • Avatar
      Achilles December 12, 2019 at 4:29 am #

      btw, this has already solved in Matlab model by one of my master degree friend whom my advisor professor assigned to help my thesis. but i dont know matlab and i want to build my own model in python as easy as possible.
      His thesis is seemed about : surrogate model, Response Surface Methodology, Reduce-Order model. etc.(OMG!)

      • Avatar
        Jason Brownlee December 12, 2019 at 6:31 am #

        Perhaps talk it through with your advisor – it is their job after all?

        • Avatar
          Achilles December 12, 2019 at 5:48 pm #

          Jason, thank you for the link. I will go through them. It seems lots of information I can get in there ! 🙂

          btw.I tried to understand my data by comparing it with MNIST data which I found relatively many tutorials therefore I am a bit more familiar with them.
          Let me imagine if I put a 28×28 pixel data of image file of a hand writed number in excel. then there would be 784 column and each column will tell information of color value ranged from 0~255. if i have 100 image files, then there will be 100 rows. This input excel tab will go into a model of classification. And 100 nodes to 1st hidden layer. then output will be 10 (0~9 number) so number of nodes from last hidden layer to output layer will be 10. so far am I learn out MNIST correctly?
          Then I tried to understand my data with same way. Here is my comparison between them.

          Input layer : 784 columns/100 rows-> 2 columns/20 rows
          number of node to 1st hidden layer : 100 -> 20
          output layer : 1 column/10 rows -> 1 column/120 rows
          (and not sure what to do with my x, y tabs)

          Let me know what a problem of my approach to understand is. Maybe I should just forget everything and start with the link you provided step by step cause I am confused horribly!

          p.s. well I already did ask. he said I can ask about ANN to the master degree guy and he can’t help too cause he only knows about ANN in Matlab.

          • Avatar
            Achilles December 12, 2019 at 6:31 pm #

            I see I am confused seriously.

            I realized that node is a things in layer, and number of data(100 image data or 20 (c, p) doesn’t matter with anything about node or weight.
            therefore the blue print of ANN should be like :

            input node : 784 -> 2 (or 1 if c and p should be paired ?)
            output node : 10 -> 60 for each x and y axis in x tab and y tab
            probably easiest model to use : ? (I think keras can do it for my case except ‘Sequential’)

            hidden layer node : hyper parameter. need to give some random number of them by human
            number of hidden layer : hyperparameter. need to give some random number of them
            weight between layers : model parameter. AI will decide from data. (https://machinelearningmastery.com/difference-between-a-parameter-and-a-hyperparameter/)

          • Avatar
            Jason Brownlee December 13, 2019 at 5:55 am #

            Sorry, not sure I follow your question.

            As for choosing the number of nodes/layers, this might help:
            https://machinelearningmastery.com/faq/single-faq/how-many-layers-and-nodes-do-i-need-in-my-neural-network

        • Avatar
          achilles December 18, 2019 at 8:34 pm #

          thank you Jason for the link and your kindness 🙂

    • Avatar
      Jason Brownlee December 12, 2019 at 6:31 am #

      Not sure how I can help exactly?

      Perhaps this process will help you to work through your predictive modeling problem:
      https://machinelearningmastery.com/start-here/#process

  4. Avatar
    Gizo January 13, 2020 at 5:43 am #

    Hello Jason.
    Thank you for the wonderful post.

    i have a question.

    which model selection method is used for time series data for regression problem?

    so that keeping its time order.

    i am trying to compare models SARIMA, MLP, 1D-CNN and LSTM.

    thanks.

    • Avatar
      Jason Brownlee January 13, 2020 at 8:32 am #

      Perhaps pick the model with the lowest error via walk-forward validation.

      • Avatar
        Gizo January 13, 2020 at 9:50 am #

        Thanks a lot, Sir.
        I will try.

  5. Avatar
    ashima March 5, 2020 at 12:39 am #

    Hey Jason,

    how to make more complex model in deep learning. For example using 2 LSTM and 1 GRU with attention layer. for 10000 of data file to train for prediction

    Thank you

  6. Avatar
    Bahar October 15, 2021 at 11:43 pm #

    Dear,

    I am a fan of your posts and I already bought al of your books :). Now I am focusing on Federated (Machine) Learning and I know that in FMLor FL we train the model on the client side. My question is how to choose our first initial model for our problem on FD? Do we need to do all the evaluation steps of selecting the best Machine Leaning model (train and test with different models) using for example Cross-Validation? And when we are sure which model is best for our current problem then choose it as an initial model, send it to our client, and let it be trained in on the client devices?

    Thanks in advacne,
    Kind Regards,
    Bahar

    • Adrian Tam
      Adrian Tam October 20, 2021 at 8:28 am #

      I think your description is correct. Cross-validation always give us confidence on our model selection. But for federated learning, I suspect it is more tolerant for wrong or sub-optimal models as there are other models to correct one client’s mistake (just like ensemble learning).

  7. Avatar
    Mohammad April 24, 2024 at 3:53 am #

    Hi Jason,

    Thanks for writing the article. It is very informative.

    There is a “one standard error” rule that is used with cross validation to consider complexity alongside
    performance for model selection. It helps to select the most parsimonious model with a prediction error that is not much worse than its minimum validation error. Would you recommend using it for regression tasks ?

    Best regards

    • Avatar
      James Carmichael April 24, 2024 at 9:20 am #

      Hi Mohammad…Yes, using the “one standard error” rule in cross-validation for regression tasks is often recommended. This rule allows for the selection of a simpler model if its performance is within one standard error of the best model found through cross-validation. This approach is particularly beneficial because it balances model complexity and generalization error, helping prevent overfitting while still achieving robust predictive performance.

      Implementing this rule involves choosing the simplest model whose validation error is not significantly worse than that of the best-performing model. This method promotes model simplicity and robustness, which are valuable attributes in many practical applications, especially in regression tasks where the risk of overfitting is a concern.

      • Avatar
        Mohammad April 25, 2024 at 10:07 pm #

        Hi James,

        Thank you for the response. Indeed, it is an interesting approach. However, I wonder how good it is in finding best model across different types of models, for e.g, linear regression, decision trees and neural networks. Most of the literature that I came across only use it to find best model from models of the same type. If you have come across any sources that implement it to find best model across different model types, please share.

        Best regards

Leave a Reply