Get Your Data Ready For Machine Learning in R with Pre-Processing

Preparing data is required to get the best results from machine learning algorithms.

In this post you will discover how to transform your data in order to best expose its structure to machine learning algorithms in R using the caret package.

You will work through 8 popular and powerful data transforms with recipes that you can study or copy and paste int your current or next machine learning project.

Let’s get started.

Pre-Process Your Machine Learning Dataset in R

Pre-Process Your Machine Learning Dataset in R
Photo by Fraser Cairns, some rights reserved.

Need For Data Pre-Processing

You want to get the best accuracy from machine learning algorithms on your datasets.

Some machine learning algorithms require the data to be in a specific form. Whereas other algorithms can perform better if the data is prepared in a specific way, but not always. Finally, your raw data may not be in the best format to best expose the underlying structure and relationships to the predicted variables.

It is important to prepare your data in such a way that it gives various different machine learning algorithms the best chance on your problem.

You need to pre-process your raw data as part of your machine learning project.

Data Pre-Processing Methods

It is hard to know which data-preprocessing methods to use.

You can use rules of thumb such as:

  • Instance based methods are more effective if the input attributes have the same scale.
  • Regression methods can work better of the input attributes are standardized.

These are heuristics, but not hard and fast laws of machine learning, because sometimes you can get better results if you ignore them.

You should try a range of data transforms with a range of different machine learning algorithms. This will help you discover both good representations for your data and algorithms that are better at exploiting the structure that those representations expose.

It is a good idea to spot check a number of transforms both in isolation as well as combinations of transforms.

In the next section you will discover how you can apply data transforms in order to prepare your data in R using the caret package.

Need more elp with R for Machine Learning?

Take my free 14-day email course and discover how to use R on your project (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Start Your FREE Mini-Course Now!

Data Pre-Processing With Caret in R

The caret package in R provides a number of useful data transforms.

These transforms can be used in two ways.

  • Standalone: Transforms can be modeled from training data and applied to multiple datasets. The model of the transform is prepared using the preProcess() function and applied to a dataset using the predict() function.
  • Training: Transforms can prepared and applied automatically during model evaluation. Transforms applied during training are prepared using the preProcess() and passed to the train() function via the preProcess argument.

A number of data preprocessing examples are presented in this section. They are presented using the standalone method, but you can just as easily use the prepared preprocessed model during model training.

All of the preprocessing examples in this section are for numerical data. Note that the preprocessing function will skip over non-numeric data without error.

You can learn more about the data transforms provided by the caret package by reading the help for the preProcess function by typing ?preProcess and by reading the Caret Pre-Processing page.

The data transforms presented are more likely to be useful for algorithms such as regression algorithms, instance-based methods (like kNN and LVQ), support vector machines and neural networks. They are less likely to be useful for tree and rule based methods.

Summary of Transform Methods

Below is a quick summary of all of the transform methods supported in the method argument of the preProcess() function in caret.

  • BoxCox“: apply a Box–Cox transform, values must be non-zero and positive.
  • YeoJohnson“: apply a Yeo-Johnson transform, like a BoxCox, but values can be negative.
  • expoTrans“: apply a power transform like BoxCox and YeoJohnson.
  • zv“: remove attributes with a zero variance (all the same value).
  • nzv“: remove attributes with a near zero variance (close to the same value).
  • center“: subtract mean from values.
  • scale“: divide values by standard deviation.
  • range“: normalize values.
  • pca“: transform data to the principal components.
  • ica“: transform data to the independent components.
  • spatialSign“: project data onto a unit circle.

The following sections will demonstrate some of the more popular methods.

1. Scale

The scale transform calculates the standard deviation for an attribute and divides each value by that standard deviation.

Running the recipe, you will see:

2. Center

The center transform calculates the mean for an attribute and subtracts it from each value.

Running the recipe, you will see:

3. Standardize

Combining the scale and center transforms will standardize your data. Attributes will have a mean value of 0 and a standard deviation of 1.

Notice how we can list multiple methods in a list when defining the preProcess procedure in caret. Running the recipe, you will see:

4. Normalize

Data values can be scaled into the range of [0, 1] which is called normalization.

Running the recipe, you will see:

5. Box-Cox Transform

When an attribute has a Gaussian-like distribution but is shifted, this is called a skew. The distribution of an attribute can be shifted to reduce the skew and make it more Gaussian. The BoxCox transform can perform this operation (assumes all values are positive).

Notice, we applied the transform to only two attributes that appear to have a skew. Running the recipe, you will see:

For more on this transform see the Box-Cox transform Wikiepdia.

6. Yeo-Johnson Transform

Another power-transform like the Box-Cox transform, but it supports raw values that are equal to zero and negative.

Running the recipe, you will see:

7. Principal Component Analysis

Transform the data to the principal components. The transform keeps components above the variance threshold (default=0.95) or the number of components can be specified (pcaComp). The result is attributes that are uncorrelated, useful for algorithms like linear and generalized linear regression.

Notice that when we run the recipe that only two principal components are selected.

8. Independent Component Analysis

Transform the data to the independent components. Unlike PCA, ICA retains those components that are independent. You must specify the number of desired independent components with the n.comp argument. Useful for algorithms such as naive bayes.

Running the recipe, you will see:

Tips For Data Transforms

Below are some tips for getting the most out of data transforms.

  • Actually Use Them. You are a step ahead if you are thinking about and using data transforms to prepare your data. It is an easy step to forget or skip over and often has a huge impact on the accuracy of your final models.
  • Use a Variety. Try a number of different data transforms on your data with a suite of different machine learning algorithms.
  • Review a Summary. It is a good idea to summarize your data before and after a transform to understand the effect it had. The summary() function can be very useful.
  • Visualize Data. It is also a good idea to visualize the distribution of your data before and after to get a spatial intuition for the effect of the transform.

Summary

In this section you discovered 8 data preprocessing methods that you can use on your data in R via the caret package:

  • Data scaling
  • Data centering
  • Data standardization
  • Data normalization
  • The Box-Cox Transform
  • The Yeo-Johnson Transform
  • PCA Transform
  • ICA Transform

You can practice with the recipes presented in this section or apply them on your current or next machine learning project.

Next Step

Did you try out these recipes?

  1. Start your R interactive environment.
  2. Type or copy-paste the recipes above and try them out.
  3. Use the built-in help in R to learn more about the functions used.

Do you have a question. Ask it in the comments and I will do my best to answer it.


Frustrated With Your Progress In R Machine Learning?

Master Machine Learning With R

Develop Your Own Models in Minutes

…with just a few lines of R code

Discover how in my new Ebook:
Machine Learning Mastery With R

Covers self-study tutorials and end-to-end projects like:
Loading data, visualization, build models, tuning, and much more…

Finally Bring Machine Learning To
Your Own Projects

Skip the Academics. Just Results.

Click to learn more.


19 Responses to Get Your Data Ready For Machine Learning in R with Pre-Processing

  1. yash July 2, 2016 at 9:01 pm #

    Hi,

    Under the section Summary of Transform Methods, it is mentioned that,
    “center“: divide values by standard deviation.
    “scale“: subtract mean from values.
    But when it comes to demonstration of methods, the functionality of center and scale is interchanged as,

    1. Scale

    The scale transform calculates the standard deviation for an attribute and divides each value by that standard deviation.

    2. Center

    The center transform calculates the mean for an attribute and subtracts it from each value.

    So which is the correct functionality of these methods.

    • Jason Brownlee July 3, 2016 at 7:36 am #

      Quite right, I have fixed the typo. Sorry about that.

      Centre: subtract mean from values.
      Scale: divide values by standard deviation.

  2. Michael November 2, 2016 at 11:13 am #

    Hi Jason

    What do you do if some of your variables are category or factors, will preprocessing ignore these?

    • Jason Brownlee November 3, 2016 at 7:49 am #

      Hi Michael, good question.

      It may just ignore them, I believe that is the default behavior.

  3. Michael November 2, 2016 at 3:55 pm #

    Hi Jason

    I love your books, they are well written and

    I am doing a Kaggle competition ie

    https://www.kaggle.com/c/allstate-claims-severity

    Alot of the predictors are categorical, which I have turned to factors etc ie a,b,c etc with a few continuous.

    Could you please point me in the right direction re preprocessing this data with caret.

    I want to use random forest on the model.

    Cheers

    Michael

  4. Zoraze November 4, 2016 at 12:48 am #

    Hi,

    After the preprocessing, how i can transformed back my data to original values?

    Kind regards,
    Zoraze

    • Jason Brownlee November 4, 2016 at 9:10 am #

      Great question Zoraze.

      You will need to do the transform manually, like normalize. Then use the same coefficients to inverse the transform later.

      If you use caret, it can perform these operations automatically during training if needed.

  5. Goran January 20, 2017 at 10:02 pm #

    Hello,

    When should we be applying standardization ?
    I am currently applying normalization ( variables expressed in different units). Should I apply standardization next ?

    • Jason Brownlee January 21, 2017 at 10:33 am #

      Great question Goran.

      Strictly, standardization helps when your univariate distribution is Gaussian, the units vary across features and your algorithm assumes units do not vary and univariate distributions are Gaussian.

      Results can be good or even better when you break this strict heuristic.

      Generally, I recommend trying a suite of different data scaling and transforms for a problem in order to flush out what representations work well for your problem.

      • Goran January 23, 2017 at 9:30 pm #

        Moreover, is this the correct flow –

        Outlier removal -> Impute missing values -> Data treatment (Normalize etc) -> Check for correlations ?

        • Jason Brownlee January 24, 2017 at 11:04 am #

          Looks good to me Goran!

          • Goran January 24, 2017 at 9:43 pm #

            Thank you, Jason.

  6. Natasa June 27, 2017 at 9:49 pm #

    How is the preprocessing different when we have data from accelerometer signals which measure gait. For example the data consists of x,y,z which are the measures of the accelerometer, another column with miliseconds and a dependend variable which states the event, for example walking or sitting. In this case, do we have to create windows first and then start extracting features?

    • Jason Brownlee June 28, 2017 at 6:24 am #

      Consider taking a look in the literature to see how this type of data has been prepared in other experiments.

  7. Grace August 4, 2017 at 5:28 am #

    Hi Jason, I trained my NN model on pre-processed (normalized) data and then used the model to predict (the data I fed is also normalized). How do I convert the prediction results to be un-normalized so that it makes sense? Thanks.

    • Jason Brownlee August 4, 2017 at 7:04 am #

      Yes, you can inverse the scaling transform. Sorry I do not have an example on hand.

  8. Vaibhav Nellore September 6, 2017 at 5:28 pm #

    Hi Jason,

    I used preProcess to standardize my train data set(data frame). Then, I used it and developed a model.

    Then,to test my model on my test dataset(data frame), I need to standardize variables in test dataset with the same mean and std of variables in train dataset. (just, correct me if i am wrong!?) If so, Is there any package or method to do this?

    • Jason Brownlee September 7, 2017 at 12:50 pm #

      I have examples of this in Python, but not R, sorry.

Leave a Reply