How to Get the Most From Your Machine Learning Data

The data that you use, and how you use it, will likely define the success of your predictive modeling problem.

Data and the framing of your problem may be the point of biggest leverage on your project.

Choosing the wrong data or the wrong framing for your problem may lead to a model with poor performance or, at worst, a model that cannot converge.

It is not possible to analytically calculate what data to use or how to use it, but it is possible to use a trial-and-error process to discover how to best use the data that you have.

In this post, you will discover to get the most from your data on your machine learning project.

After reading this post, you will know:

  • The importance of exploring alternate framings of your predictive modeling problem.
  • The need to develop a suite of “views” on your input data and to systematically test each.
  • The notion that feature selection, engineering, and preparation are ways of creating more views on your problem.

Kick-start your project with my new book Data Preparation for Machine Learning, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

How to Get the Most From Your Machine Learning Data

How to Get the Most From Your Machine Learning Data
Photo by Jean-Marc Bolfing, some rights reserved.

Overview

This post is divided into 8 parts; they are:

  1. Problem Framing
  2. Collect More Data
  3. Study Your Data
  4. Training Data Sample Size
  5. Feature Selection
  6. Feature Engineering
  7. Data Preparation
  8. Go Further

1. Problem Framing

Brainstorm multiple ways to frame your predictive modeling problem.

The framing of the problem means the combination of:

  • Inputs
  • Outputs
  • Problem Type

For example:

  • Can you use more or less data as inputs to the model?
  • Can you predict something else instead?
  • Can you change the problem to be regression/classification/sequence/etc.?

The more creative you get, the better.

Use ideas from other projects, papers, and the domain itself.

Brainstorm. Write down all of the ideas, even if they are crazy.

I have some frameworks that will help with brainstorming the framing here:

I talk a little about changing the problem type in this post:

Want to Get Started With Data Preparation?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

2. Collect More Data

Get more data than you need, even data that is tangentially related to the outcome being predicted.

We cannot know how much data will be needed.

Data is the currency spent during model development. It is the oxygen needed by the project to breathe. Each time you use some data, it is less data available for other tasks.

You need to spend data on tasks like:

  • Model training.
  • Model evaluation.
  • Model tuning.
  • Model validation.

Further, the project is new. No one has done your specific project before, modeled your specific data. You don’t really know what features will be useful yet. You might have ideas, but you don’t know. Collect them all; make them all available at this stage.

3. Study Your Data

Use every data visualization you can think of to look at your data from every angle.

  • Looking at raw data helps. You will notice things.
  • Looking at summary statistics helps. Again, you will notice things.
  • Data visualization is like a beautiful combination of these two ways of learning. You will notice a lot more things.

Spend a long time with your raw data and summary statistics. Then move on to the visualizations last as they can take more time to prepare.

Use every data visualization you can think of and glean from books and papers on your data.

  • Review plots.
  • Save plots.
  • Annotate plots.
  • Show plots to domain experts.

You are seeking a little more insight into the data. Ideas that you can use to help better select, engineer, and prepare data for modeling. It will pay off.

4. Training Data Sample Size

Perform a sensitivity analysis with your data sample to see how much (or little) data you actually need.

You do not have all observations. If you did, you would not need to make predictions for new data.

Instead, you are working with a sample of the data. Therefore, there is an open question as to how much data will be needed to fit the model.

Don’t assume that more is better. Test.

  • Design experiments to see how model skill changes with sample size.
  • Use statistics to see how important trends and tendencies change with sample size.

Without this knowledge, you won’t know enough about your test harness to comment on model skill sensibly.

Learn more about sample size in this post:

5. Feature Selection

Create many different views of your input features and test each one.

You don’t know what variables will be helpful or most helpful in your predictive modeling problem.

  • You can guess.
  • You can use advice from domain experts.
  • You can even use suggestions from feature selection methods.

But they are all just guesses.

Each set of suggested input features is a “view” on your problem. An idea on what features might be useful for modeling and predicting the output variable.

Brainstorm, compute, and collect as many different views of your input data as you can.

Design experiments and carefully test and compare each view. Use data to inform you which features and which view are the most predictive.

For more on feature selection, see this post:

6. Feature Engineering

Use feature engineering to create additional features and views on your predictive modeling problem.

Sometimes you have all of the data you can get, but a given feature or set of features locks up knowledge that is too dense for the machine learning methods to learn and map to the outcome variable.

Examples include:

  • Date/Times.
  • Transactions.
  • Descriptions.

Break down these data into simpler additional component features, such as counts, flags, and other elements.

Make things as simple as you can for the modeling process.

For more on feature engineering, see the post:

7. Data Preparation

Pre-process data every way you can think of to meet the expectations of algorithms and more.

Pre-processing data, like feature selection and feature engineering, creates additional views on your input features.

Some algorithms have preferences regarding pre-processing, such as:

  • Normalized input features.
  • Standardized input features.
  • Make input features stationary.

Prepare the data in anticipation of these expectation, but then go further.

Apply every data pre-processing method you can think of on your data. Keep creating new views on your problem and test them with one or a suite of models to see what works best.

Your objective here is to discover the view on the data that best exposes the unknown underlying structure of the mapping problem to the learning algorithm.

8. Go Further

You can always go further.

There is usually more data you can collect, more views you can create on your data.

Brainstorm.

One easy win once you feel like you are at the end of the road is to begin investigating ensembles of models created from different views of your modeling problem.

It’s simple and highly effective, especially if the views expose different structures of the underlying mapping problem (e.g. the models have uncorrelated errors).

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Summary

In this post, you discovered techniques that you can use to get the most out of your data on your predictive modeling problem.

Specifically, you learned:

  • The importance of exploring alternate framings of your predictive modeling problem.
  • The need to develop a suite of “views” on your input data and to systematically test each.
  • The notion that feature selection, engineering, and preparation are ways of creating more views on your problem.

Are there more ideas that you have for getting the most out of your data?
What do you normally do on a project?
Let me know in the comments below.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Get a Handle on Modern Data Preparation!

Data Preparation for Machine Learning

Prepare Your Machine Learning Data in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Data Preparation for Machine Learning

It provides self-study tutorials with full working code on:
Feature Selection, RFE, Data Cleaning, Data Transforms, Scaling, Dimensionality Reduction, and much more...

Bring Modern Data Preparation Techniques to
Your Machine Learning Projects


See What's Inside

3 Responses to How to Get the Most From Your Machine Learning Data

  1. Avatar
    Elie Kawerk April 16, 2018 at 6:08 am #

    Thank you, Jason for this great post!

    I’d also add the following points:

    – Data Cleaning: handling missing values, corrupt records, etc..

    – Outlier Detection and Handling:

    – pick suitable methods to detect outliers,
    – remove outliers if you can make sure that they are erroneous,
    – Flag outliers with a dummy variable if there’s no reason to drop them.

    Best,
    Elie

  2. Avatar
    Alain April 20, 2018 at 2:09 pm #

    As always and excellent concise article that gets to the heart of the topic.

    To me the “Can you change the problem to be regression/classification/sequence/etc.?” is key – sometimes a regression model is terrible while logistic regressions solves the issue nicely.

    I agree with Elie as well – those can make or break a model. I have seen models fail to converge because of bad input data with massive (10^30 etc) values in currency fields.

    • Avatar
      Jason Brownlee April 20, 2018 at 2:21 pm #

      Thanks Alain!

      Yes, I agree. Problem framing is often the point of biggest leverage in a project. And often overlooked.

Leave a Reply