Get Your Data Ready For Machine Learning in R with Pre-Processing

By Jason Brownlee on August 22, 2019 in R Machine Learning 61

Preparing data is required to get the best results from machine learning algorithms.

In this post you will discover how to transform your data in order to best expose its structure to machine learning algorithms in R using the caret package.

You will work through 8 popular and powerful data transforms with recipes that you can study or copy and paste int your current or next machine learning project.

Kick-start your project with my new book Machine Learning Mastery With R, including step-by-step tutorials and the R source code files for all examples.

Let’s get started.

Pre-Process Your Machine Learning Dataset in R
Photo by Fraser Cairns, some rights reserved.

Need For Data Pre-Processing

You want to get the best accuracy from machine learning algorithms on your datasets.

Some machine learning algorithms require the data to be in a specific form. Whereas other algorithms can perform better if the data is prepared in a specific way, but not always. Finally, your raw data may not be in the best format to best expose the underlying structure and relationships to the predicted variables.

It is important to prepare your data in such a way that it gives various different machine learning algorithms the best chance on your problem.

You need to pre-process your raw data as part of your machine learning project.

Data Pre-Processing Methods

It is hard to know which data-preprocessing methods to use.

You can use rules of thumb such as:

Instance based methods are more effective if the input attributes have the same scale.
Regression methods can work better of the input attributes are standardized.

These are heuristics, but not hard and fast laws of machine learning, because sometimes you can get better results if you ignore them.

You should try a range of data transforms with a range of different machine learning algorithms. This will help you discover both good representations for your data and algorithms that are better at exploiting the structure that those representations expose.

It is a good idea to spot check a number of transforms both in isolation as well as combinations of transforms.

In the next section you will discover how you can apply data transforms in order to prepare your data in R using the caret package.

Need more Help with R for Machine Learning?

Take my free 14-day email course and discover how to use R on your project (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Data Pre-Processing With Caret in R

The caret package in R provides a number of useful data transforms.

These transforms can be used in two ways.

Standalone: Transforms can be modeled from training data and applied to multiple datasets. The model of the transform is prepared using the preProcess() function and applied to a dataset using the predict() function.
Training: Transforms can prepared and applied automatically during model evaluation. Transforms applied during training are prepared using the preProcess() and passed to the train() function via the preProcess argument.

A number of data preprocessing examples are presented in this section. They are presented using the standalone method, but you can just as easily use the prepared preprocessed model during model training.

All of the preprocessing examples in this section are for numerical data. Note that the preprocessing functions will skip over non-numeric data without raising an error.

You can learn more about the data transforms provided by the caret package by reading the help for the preProcess function by typing ?preProcess and by reading the Caret Pre-Processing page.

The data transforms presented are more likely to be useful for algorithms such as regression algorithms, instance-based methods (like kNN and LVQ), support vector machines and neural networks. They are less likely to be useful for tree and rule based methods.

Summary of Transform Methods

Below is a quick summary of all of the transform methods supported in the method argument of the preProcess() function in caret.

“BoxCox“: apply a Box–Cox transform, values must be non-zero and positive.
“YeoJohnson“: apply a Yeo-Johnson transform, like a BoxCox, but values can be negative.
“expoTrans“: apply a power transform like BoxCox and YeoJohnson.
“zv“: remove attributes with a zero variance (all the same value).
“nzv“: remove attributes with a near zero variance (close to the same value).
“center“: subtract mean from values.
“scale“: divide values by standard deviation.
“range“: normalize values.
“pca“: transform data to the principal components.
“ica“: transform data to the independent components.
“spatialSign“: project data onto a unit circle.

The following sections will demonstrate some of the more popular methods.

1. Scale

The scale transform calculates the standard deviation for an attribute and divides each value by that standard deviation.

# load libraries
library(caret)
# load the dataset
data(iris)
# summarize data
summary(iris[,1:4])
# calculate the pre-process parameters from the dataset
preprocessParams <- preProcess(iris[,1:4], method=c("scale"))
# summarize transform parameters
print(preprocessParams)
# transform the dataset using the parameters
transformed <- predict(preprocessParams, iris[,1:4])
# summarize the transformed dataset
summary(transformed)

# load libraries

library(caret)

# load the dataset

data(iris)

# summarize data

summary(iris[,1:4])

# calculate the pre-process parameters from the dataset

preprocessParams <- preProcess(iris[,1:4], method=c("scale"))

# summarize transform parameters

print(preprocessParams)

# transform the dataset using the parameters

transformed <- predict(preprocessParams, iris[,1:4])

# summarize the transformed dataset

summary(transformed)

Running the recipe, you will see:

  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  

Created from 150 samples and 4 variables

Pre-processing:
  - ignored (0)
  - scaled (4)

  Sepal.Length    Sepal.Width      Petal.Length     Petal.Width    
 Min.   :5.193   Min.   : 4.589   Min.   :0.5665   Min.   :0.1312  
 1st Qu.:6.159   1st Qu.: 6.424   1st Qu.:0.9064   1st Qu.:0.3936  
 Median :7.004   Median : 6.883   Median :2.4642   Median :1.7055  
 Mean   :7.057   Mean   : 7.014   Mean   :2.1288   Mean   :1.5734  
 3rd Qu.:7.729   3rd Qu.: 7.571   3rd Qu.:2.8890   3rd Qu.:2.3615  
 Max.   :9.540   Max.   :10.095   Max.   :3.9087   Max.   :3.2798

Sepal.Length Sepal.Width Petal.Length Petal.Width

Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100

1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300

Median :5.800 Median :3.000 Median :4.350 Median :1.300

Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199

3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800

Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500

Created from 150 samples and 4 variables

Pre-processing:

- ignored (0)

- scaled (4)

Sepal.Length Sepal.Width Petal.Length Petal.Width

Min. :5.193 Min. : 4.589 Min. :0.5665 Min. :0.1312

1st Qu.:6.159 1st Qu.: 6.424 1st Qu.:0.9064 1st Qu.:0.3936

Median :7.004 Median : 6.883 Median :2.4642 Median :1.7055

Mean :7.057 Mean : 7.014 Mean :2.1288 Mean :1.5734

3rd Qu.:7.729 3rd Qu.: 7.571 3rd Qu.:2.8890 3rd Qu.:2.3615

Max. :9.540 Max. :10.095 Max. :3.9087 Max. :3.2798

2. Center

The center transform calculates the mean for an attribute and subtracts it from each value.

# load libraries
library(caret)
# load the dataset
data(iris)
# summarize data
summary(iris[,1:4])
# calculate the pre-process parameters from the dataset
preprocessParams <- preProcess(iris[,1:4], method=c("center"))
# summarize transform parameters
print(preprocessParams)
# transform the dataset using the parameters
transformed <- predict(preprocessParams, iris[,1:4])
# summarize the transformed dataset
summary(transformed)

# load libraries

library(caret)

# load the dataset

data(iris)

# summarize data

summary(iris[,1:4])

# calculate the pre-process parameters from the dataset

preprocessParams <- preProcess(iris[,1:4], method=c("center"))

# summarize transform parameters

print(preprocessParams)

# transform the dataset using the parameters

transformed <- predict(preprocessParams, iris[,1:4])

# summarize the transformed dataset

summary(transformed)

Running the recipe, you will see:

  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  

Created from 150 samples and 4 variables

Pre-processing:
  - centered (4)
  - ignored (0)

 Sepal.Length       Sepal.Width        Petal.Length     Petal.Width     
 Min.   :-1.54333   Min.   :-1.05733   Min.   :-2.758   Min.   :-1.0993  
 1st Qu.:-0.74333   1st Qu.:-0.25733   1st Qu.:-2.158   1st Qu.:-0.8993  
 Median :-0.04333   Median :-0.05733   Median : 0.592   Median : 0.1007  
 Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.000   Mean   : 0.0000  
 3rd Qu.: 0.55667   3rd Qu.: 0.24267   3rd Qu.: 1.342   3rd Qu.: 0.6007  
 Max.   : 2.05667   Max.   : 1.34267   Max.   : 3.142   Max.   : 1.3007

Sepal.Length Sepal.Width Petal.Length Petal.Width

Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100

1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300

Median :5.800 Median :3.000 Median :4.350 Median :1.300

Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199

3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800

Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500

Created from 150 samples and 4 variables

Pre-processing:

- centered (4)

- ignored (0)

Sepal.Length Sepal.Width Petal.Length Petal.Width

Min. :-1.54333 Min. :-1.05733 Min. :-2.758 Min. :-1.0993

1st Qu.:-0.74333 1st Qu.:-0.25733 1st Qu.:-2.158 1st Qu.:-0.8993

Median :-0.04333 Median :-0.05733 Median : 0.592 Median : 0.1007

Mean : 0.00000 Mean : 0.00000 Mean : 0.000 Mean : 0.0000

3rd Qu.: 0.55667 3rd Qu.: 0.24267 3rd Qu.: 1.342 3rd Qu.: 0.6007

Max. : 2.05667 Max. : 1.34267 Max. : 3.142 Max. : 1.3007

3. Standardize

Combining the scale and center transforms will standardize your data. Attributes will have a mean value of 0 and a standard deviation of 1.

# load libraries
library(caret)
# load the dataset
data(iris)
# summarize data
summary(iris[,1:4])
# calculate the pre-process parameters from the dataset
preprocessParams <- preProcess(iris[,1:4], method=c("center", "scale"))
# summarize transform parameters
print(preprocessParams)
# transform the dataset using the parameters
transformed <- predict(preprocessParams, iris[,1:4])
# summarize the transformed dataset
summary(transformed)

# load libraries

library(caret)

# load the dataset

data(iris)

# summarize data

summary(iris[,1:4])

# calculate the pre-process parameters from the dataset

preprocessParams <- preProcess(iris[,1:4], method=c("center", "scale"))

# summarize transform parameters

print(preprocessParams)

# transform the dataset using the parameters

transformed <- predict(preprocessParams, iris[,1:4])

# summarize the transformed dataset

summary(transformed)

Notice how we can list multiple methods in a list when defining the preProcess procedure in caret. Running the recipe, you will see:

  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  

Created from 150 samples and 4 variables

Pre-processing:
  - centered (4)
  - ignored (0)
  - scaled (4)

 Sepal.Length       Sepal.Width       Petal.Length      Petal.Width     
 Min.   :-1.86378   Min.   :-2.4258   Min.   :-1.5623   Min.   :-1.4422  
 1st Qu.:-0.89767   1st Qu.:-0.5904   1st Qu.:-1.2225   1st Qu.:-1.1799  
 Median :-0.05233   Median :-0.1315   Median : 0.3354   Median : 0.1321  
 Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 0.67225   3rd Qu.: 0.5567   3rd Qu.: 0.7602   3rd Qu.: 0.7880  
 Max.   : 2.48370   Max.   : 3.0805   Max.   : 1.7799   Max.   : 1.7064

Sepal.Length Sepal.Width Petal.Length Petal.Width

Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100

1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300

Median :5.800 Median :3.000 Median :4.350 Median :1.300

Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199

3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800

Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500

Created from 150 samples and 4 variables

Pre-processing:

- centered (4)

- ignored (0)

- scaled (4)

Sepal.Length Sepal.Width Petal.Length Petal.Width

Min. :-1.86378 Min. :-2.4258 Min. :-1.5623 Min. :-1.4422

1st Qu.:-0.89767 1st Qu.:-0.5904 1st Qu.:-1.2225 1st Qu.:-1.1799

Median :-0.05233 Median :-0.1315 Median : 0.3354 Median : 0.1321

Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000

3rd Qu.: 0.67225 3rd Qu.: 0.5567 3rd Qu.: 0.7602 3rd Qu.: 0.7880

Max. : 2.48370 Max. : 3.0805 Max. : 1.7799 Max. : 1.7064

4. Normalize

Data values can be scaled into the range of [0, 1] which is called normalization.

# load libraries
library(caret)
# load the dataset
data(iris)
# summarize data
summary(iris[,1:4])
# calculate the pre-process parameters from the dataset
preprocessParams <- preProcess(iris[,1:4], method=c("range"))
# summarize transform parameters
print(preprocessParams)
# transform the dataset using the parameters
transformed <- predict(preprocessParams, iris[,1:4])
# summarize the transformed dataset
summary(transformed)

# load libraries

library(caret)

# load the dataset

data(iris)

# summarize data

summary(iris[,1:4])

# calculate the pre-process parameters from the dataset

preprocessParams <- preProcess(iris[,1:4], method=c("range"))

# summarize transform parameters

print(preprocessParams)

# transform the dataset using the parameters

transformed <- predict(preprocessParams, iris[,1:4])

# summarize the transformed dataset

summary(transformed)

Running the recipe, you will see:

  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  

Created from 150 samples and 4 variables

Pre-processing:
  - ignored (0)
  - re-scaling to [0, 1] (4)


  Sepal.Length     Sepal.Width      Petal.Length     Petal.Width     
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
 1st Qu.:0.2222   1st Qu.:0.3333   1st Qu.:0.1017   1st Qu.:0.08333  
 Median :0.4167   Median :0.4167   Median :0.5678   Median :0.50000  
 Mean   :0.4287   Mean   :0.4406   Mean   :0.4675   Mean   :0.45806  
 3rd Qu.:0.5833   3rd Qu.:0.5417   3rd Qu.:0.6949   3rd Qu.:0.70833  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000

Sepal.Length Sepal.Width Petal.Length Petal.Width

Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100

1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300

Median :5.800 Median :3.000 Median :4.350 Median :1.300

Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199

3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800

Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500

Created from 150 samples and 4 variables

Pre-processing:

- ignored (0)

- re-scaling to [0, 1] (4)

Sepal.Length Sepal.Width Petal.Length Petal.Width

Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000

1st Qu.:0.2222 1st Qu.:0.3333 1st Qu.:0.1017 1st Qu.:0.08333

Median :0.4167 Median :0.4167 Median :0.5678 Median :0.50000

Mean :0.4287 Mean :0.4406 Mean :0.4675 Mean :0.45806

3rd Qu.:0.5833 3rd Qu.:0.5417 3rd Qu.:0.6949 3rd Qu.:0.70833

Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000

5. Box-Cox Transform

When an attribute has a Gaussian-like distribution but is shifted, this is called a skew. The distribution of an attribute can be shifted to reduce the skew and make it more Gaussian. The BoxCox transform can perform this operation (assumes all values are positive).

# load libraries
library(mlbench)
library(caret)
# load the dataset
data(PimaIndiansDiabetes)
# summarize pedigree and age
summary(PimaIndiansDiabetes[,7:8])
# calculate the pre-process parameters from the dataset
preprocessParams <- preProcess(PimaIndiansDiabetes[,7:8], method=c("BoxCox"))
# summarize transform parameters
print(preprocessParams)
# transform the dataset using the parameters
transformed <- predict(preprocessParams, PimaIndiansDiabetes[,7:8])
# summarize the transformed dataset (note pedigree and age)
summary(transformed)

# load libraries

library(mlbench)

library(caret)

# load the dataset

data(PimaIndiansDiabetes)

# summarize pedigree and age

summary(PimaIndiansDiabetes[,7:8])

# calculate the pre-process parameters from the dataset

preprocessParams <- preProcess(PimaIndiansDiabetes[,7:8], method=c("BoxCox"))

# summarize transform parameters

print(preprocessParams)

# transform the dataset using the parameters

transformed <- predict(preprocessParams, PimaIndiansDiabetes[,7:8])

# summarize the transformed dataset (note pedigree and age)

summary(transformed)

Notice, we applied the transform to only two attributes that appear to have a skew. Running the recipe, you will see:

    pedigree           age       
 Min.   :0.0780   Min.   :21.00  
 1st Qu.:0.2437   1st Qu.:24.00  
 Median :0.3725   Median :29.00  
 Mean   :0.4719   Mean   :33.24  
 3rd Qu.:0.6262   3rd Qu.:41.00  
 Max.   :2.4200   Max.   :81.00  

Created from 768 samples and 2 variables

Pre-processing:
  - Box-Cox transformation (2)
  - ignored (0)

Lambda estimates for Box-Cox transformation:
-0.1, -1.1

    pedigree            age        
 Min.   :-2.5510   Min.   :0.8772  
 1st Qu.:-1.4116   1st Qu.:0.8815  
 Median :-0.9875   Median :0.8867  
 Mean   :-0.9599   Mean   :0.8874  
 3rd Qu.:-0.4680   3rd Qu.:0.8938  
 Max.   : 0.8838   Max.   :0.9019

pedigree age

Min. :0.0780 Min. :21.00

1st Qu.:0.2437 1st Qu.:24.00

Median :0.3725 Median :29.00

Mean :0.4719 Mean :33.24

3rd Qu.:0.6262 3rd Qu.:41.00

Max. :2.4200 Max. :81.00

Created from 768 samples and 2 variables

Pre-processing:

- Box-Cox transformation (2)

- ignored (0)

Lambda estimates for Box-Cox transformation:

-0.1, -1.1

pedigree age

Min. :-2.5510 Min. :0.8772

1st Qu.:-1.4116 1st Qu.:0.8815

Median :-0.9875 Median :0.8867

Mean :-0.9599 Mean :0.8874

3rd Qu.:-0.4680 3rd Qu.:0.8938

Max. : 0.8838 Max. :0.9019

For more on this transform see the Box-Cox transform Wikiepdia.

6. Yeo-Johnson Transform

Another power-transform like the Box-Cox transform, but it supports raw values that are equal to zero and negative.

# load libraries
library(mlbench)
library(caret)
# load the dataset
data(PimaIndiansDiabetes)
# summarize pedigree and age
summary(PimaIndiansDiabetes[,7:8])
# calculate the pre-process parameters from the dataset
preprocessParams <- preProcess(PimaIndiansDiabetes[,7:8], method=c("YeoJohnson"))
# summarize transform parameters
print(preprocessParams)
# transform the dataset using the parameters
transformed <- predict(preprocessParams, PimaIndiansDiabetes[,7:8])
# summarize the transformed dataset (note pedigree and age)
summary(transformed)

# load libraries

library(mlbench)

library(caret)

# load the dataset

data(PimaIndiansDiabetes)

# summarize pedigree and age

summary(PimaIndiansDiabetes[,7:8])

# calculate the pre-process parameters from the dataset

preprocessParams <- preProcess(PimaIndiansDiabetes[,7:8], method=c("YeoJohnson"))

# summarize transform parameters

print(preprocessParams)

# transform the dataset using the parameters

transformed <- predict(preprocessParams, PimaIndiansDiabetes[,7:8])

# summarize the transformed dataset (note pedigree and age)

summary(transformed)

Running the recipe, you will see:

    pedigree           age       
 Min.   :0.0780   Min.   :21.00  
 1st Qu.:0.2437   1st Qu.:24.00  
 Median :0.3725   Median :29.00  
 Mean   :0.4719   Mean   :33.24  
 3rd Qu.:0.6262   3rd Qu.:41.00  
 Max.   :2.4200   Max.   :81.00  

Created from 768 samples and 2 variables

Pre-processing:
  - ignored (0)
  - Yeo-Johnson transformation (2)

Lambda estimates for Yeo-Johnson transformation:
-2.25, -1.15

    pedigree           age        
 Min.   :0.0691   Min.   :0.8450  
 1st Qu.:0.1724   1st Qu.:0.8484  
 Median :0.2265   Median :0.8524  
 Mean   :0.2317   Mean   :0.8530  
 3rd Qu.:0.2956   3rd Qu.:0.8580  
 Max.   :0.4164   Max.   :0.8644

pedigree age

Min. :0.0780 Min. :21.00

1st Qu.:0.2437 1st Qu.:24.00

Median :0.3725 Median :29.00

Mean :0.4719 Mean :33.24

3rd Qu.:0.6262 3rd Qu.:41.00

Max. :2.4200 Max. :81.00

Created from 768 samples and 2 variables

Pre-processing:

- ignored (0)

- Yeo-Johnson transformation (2)

Lambda estimates for Yeo-Johnson transformation:

-2.25, -1.15

pedigree age

Min. :0.0691 Min. :0.8450

1st Qu.:0.1724 1st Qu.:0.8484

Median :0.2265 Median :0.8524

Mean :0.2317 Mean :0.8530

3rd Qu.:0.2956 3rd Qu.:0.8580

Max. :0.4164 Max. :0.8644

7. Principal Component Analysis

Transform the data to the principal components. The transform keeps components above the variance threshold (default=0.95) or the number of components can be specified (pcaComp). The result is attributes that are uncorrelated, useful for algorithms like linear and generalized linear regression.

# load the libraries
library(mlbench)
# load the dataset
data(iris)
# summarize dataset
summary(iris)
# calculate the pre-process parameters from the dataset
preprocessParams <- preProcess(iris, method=c("center", "scale", "pca"))
# summarize transform parameters
print(preprocessParams)
# transform the dataset using the parameters
transformed <- predict(preprocessParams, iris)
# summarize the transformed dataset
summary(transformed)

# load the libraries

library(mlbench)

# load the dataset

data(iris)

# summarize dataset

summary(iris)

# calculate the pre-process parameters from the dataset

preprocessParams <- preProcess(iris, method=c("center", "scale", "pca"))

# summarize transform parameters

print(preprocessParams)

# transform the dataset using the parameters

transformed <- predict(preprocessParams, iris)

# summarize the transformed dataset

summary(transformed)

Notice that when we run the recipe that only two principal components are selected.

  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100   setosa    :50  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300   versicolor:50  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300   virginica :50  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199                  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800                  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500                  
Created from 150 samples and 5 variables

Pre-processing:
  - centered (4)
  - ignored (1)
  - principal component signal extraction (4)
  - scaled (4)

PCA needed 2 components to capture 95 percent of the variance

       Species        PC1               PC2          
 setosa    :50   Min.   :-2.7651   Min.   :-2.67732  
 versicolor:50   1st Qu.:-2.0957   1st Qu.:-0.59205  
 virginica :50   Median : 0.4169   Median :-0.01744  
                 Mean   : 0.0000   Mean   : 0.00000  
                 3rd Qu.: 1.3385   3rd Qu.: 0.59649  
                 Max.   : 3.2996   Max.   : 2.64521

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50

1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50

Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50

Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199

3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800

Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500

Created from 150 samples and 5 variables

Pre-processing:

- centered (4)

- ignored (1)

- principal component signal extraction (4)

- scaled (4)

PCA needed 2 components to capture 95 percent of the variance

Species PC1 PC2

setosa :50 Min. :-2.7651 Min. :-2.67732

versicolor:50 1st Qu.:-2.0957 1st Qu.:-0.59205

virginica :50 Median : 0.4169 Median :-0.01744

Mean : 0.0000 Mean : 0.00000

3rd Qu.: 1.3385 3rd Qu.: 0.59649

Max. : 3.2996 Max. : 2.64521

8. Independent Component Analysis

Transform the data to the independent components. Unlike PCA, ICA retains those components that are independent. You must specify the number of desired independent components with the n.comp argument. Useful for algorithms such as naive bayes.

# load libraries
library(mlbench)
library(caret)
# load the dataset
data(PimaIndiansDiabetes)
# summarize dataset
summary(PimaIndiansDiabetes[,1:8])
# calculate the pre-process parameters from the dataset
preprocessParams <- preProcess(PimaIndiansDiabetes[,1:8], method=c("center", "scale", "ica"), n.comp=5)
# summarize transform parameters
print(preprocessParams)
# transform the dataset using the parameters
transformed <- predict(preprocessParams, PimaIndiansDiabetes[,1:8])
# summarize the transformed dataset
summary(transformed)

# load libraries

library(mlbench)

library(caret)

# load the dataset

data(PimaIndiansDiabetes)

# summarize dataset

summary(PimaIndiansDiabetes[,1:8])

# calculate the pre-process parameters from the dataset

preprocessParams <- preProcess(PimaIndiansDiabetes[,1:8], method=c("center", "scale", "ica"), n.comp=5)

# summarize transform parameters

print(preprocessParams)

# transform the dataset using the parameters

transformed <- predict(preprocessParams, PimaIndiansDiabetes[,1:8])

# summarize the transformed dataset

summary(transformed)

Running the recipe, you will see:

    pregnant         glucose         pressure         triceps         insulin           mass          pedigree     
 Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00   Min.   :  0.0   Min.   : 0.00   Min.   :0.0780  
 1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00   1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437  
 Median : 3.000   Median :117.0   Median : 72.00   Median :23.00   Median : 30.5   Median :32.00   Median :0.3725  
 Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54   Mean   : 79.8   Mean   :31.99   Mean   :0.4719  
 3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00   3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262  
 Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00   Max.   :846.0   Max.   :67.10   Max.   :2.4200  
      age       
 Min.   :21.00  
 1st Qu.:24.00  
 Median :29.00  
 Mean   :33.24  
 3rd Qu.:41.00  
 Max.   :81.00  

Created from 768 samples and 8 variables

Pre-processing:
  - centered (8)
  - independent component signal extraction (8)
  - ignored (0)
  - scaled (8)

ICA used 5 components

      ICA1              ICA2               ICA3              ICA4                ICA5        
 Min.   :-5.7213   Min.   :-4.89818   Min.   :-6.0289   Min.   :-2.573436   Min.   :-1.8815  
 1st Qu.:-0.4873   1st Qu.:-0.48188   1st Qu.:-0.4693   1st Qu.:-0.640601   1st Qu.:-0.8279  
 Median : 0.1813   Median : 0.05071   Median : 0.2987   Median : 0.007582   Median :-0.2416  
 Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.000000   Mean   : 0.0000  
 3rd Qu.: 0.6839   3rd Qu.: 0.56462   3rd Qu.: 0.6941   3rd Qu.: 0.638238   3rd Qu.: 0.7048  
 Max.   : 2.1819   Max.   : 4.25611   Max.   : 1.3726   Max.   : 3.761017   Max.   : 2.9622

pregnant glucose pressure triceps insulin mass pedigree

Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.00 Min. :0.0780

1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437

Median : 3.000 Median :117.0 Median : 72.00 Median :23.00 Median : 30.5 Median :32.00 Median :0.3725

Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54 Mean : 79.8 Mean :31.99 Mean :0.4719

3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262

Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00 Max. :846.0 Max. :67.10 Max. :2.4200

age

Min. :21.00

1st Qu.:24.00

Median :29.00

Mean :33.24

3rd Qu.:41.00

Max. :81.00

Created from 768 samples and 8 variables

Pre-processing:

- centered (8)

- independent component signal extraction (8)

- ignored (0)

- scaled (8)

ICA used 5 components

ICA1 ICA2 ICA3 ICA4 ICA5

Min. :-5.7213 Min. :-4.89818 Min. :-6.0289 Min. :-2.573436 Min. :-1.8815

1st Qu.:-0.4873 1st Qu.:-0.48188 1st Qu.:-0.4693 1st Qu.:-0.640601 1st Qu.:-0.8279

Median : 0.1813 Median : 0.05071 Median : 0.2987 Median : 0.007582 Median :-0.2416

Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 Mean : 0.000000 Mean : 0.0000

3rd Qu.: 0.6839 3rd Qu.: 0.56462 3rd Qu.: 0.6941 3rd Qu.: 0.638238 3rd Qu.: 0.7048

Max. : 2.1819 Max. : 4.25611 Max. : 1.3726 Max. : 3.761017 Max. : 2.9622

Tips For Data Transforms

Below are some tips for getting the most out of data transforms.

Actually Use Them. You are a step ahead if you are thinking about and using data transforms to prepare your data. It is an easy step to forget or skip over and often has a huge impact on the accuracy of your final models.
Use a Variety. Try a number of different data transforms on your data with a suite of different machine learning algorithms.
Review a Summary. It is a good idea to summarize your data before and after a transform to understand the effect it had. The summary() function can be very useful.
Visualize Data. It is also a good idea to visualize the distribution of your data before and after to get a spatial intuition for the effect of the transform.

Summary

In this section you discovered 8 data preprocessing methods that you can use on your data in R via the caret package:

Data scaling
Data centering
Data standardization
Data normalization
The Box-Cox Transform
The Yeo-Johnson Transform
PCA Transform
ICA Transform

You can practice with the recipes presented in this section or apply them on your current or next machine learning project.

Next Step

Did you try out these recipes?

Start your R interactive environment.
Type or copy-paste the recipes above and try them out.
Use the built-in help in R to learn more about the functions used.

Do you have a question. Ask it in the comments and I will do my best to answer it.

61 Responses to Get Your Data Ready For Machine Learning in R with Pre-Processing

yash July 2, 2016 at 9:01 pm #

Hi,

Under the section Summary of Transform Methods, it is mentioned that,
“center“: divide values by standard deviation.
“scale“: subtract mean from values.
But when it comes to demonstration of methods, the functionality of center and scale is interchanged as,

1. Scale

The scale transform calculates the standard deviation for an attribute and divides each value by that standard deviation.

2. Center

The center transform calculates the mean for an attribute and subtracts it from each value.

So which is the correct functionality of these methods.

Reply
- Jason Brownlee July 3, 2016 at 7:36 am #
  
  Quite right, I have fixed the typo. Sorry about that.
  
  Centre: subtract mean from values.
  Scale: divide values by standard deviation.
  
  Reply
Michael November 2, 2016 at 11:13 am #

Hi Jason

What do you do if some of your variables are category or factors, will preprocessing ignore these?

Reply
- Jason Brownlee November 3, 2016 at 7:49 am #
  
  Hi Michael, good question.
  
  It may just ignore them, I believe that is the default behavior.
  
  Reply
Michael November 2, 2016 at 3:55 pm #

Hi Jason

I love your books, they are well written and

I am doing a Kaggle competition ie

https://www.kaggle.com/c/allstate-claims-severity

Alot of the predictors are categorical, which I have turned to factors etc ie a,b,c etc with a few continuous.

Could you please point me in the right direction re preprocessing this data with caret.

I want to use random forest on the model.

Cheers

Michael

Reply
- Jason Brownlee November 3, 2016 at 7:52 am #
  
  Thnaks Michael.
  
  Yes, I would recommend experimenting with turning them into binary variables, also called one hot encoding or dummy variables. You can learnmore here:
  https://topepo.github.io/caret/pre-processing.html
  
  Reply
Zoraze November 4, 2016 at 12:48 am #

Hi,

After the preprocessing, how i can transformed back my data to original values?

Kind regards,
Zoraze

Reply
- Jason Brownlee November 4, 2016 at 9:10 am #
  
  Great question Zoraze.
  
  You will need to do the transform manually, like normalize. Then use the same coefficients to inverse the transform later.
  
  If you use caret, it can perform these operations automatically during training if needed.
  
  Reply
Goran January 20, 2017 at 10:02 pm #

Hello,

When should we be applying standardization ?
I am currently applying normalization ( variables expressed in different units). Should I apply standardization next ?

Reply
- Jason Brownlee January 21, 2017 at 10:33 am #
  
  Great question Goran.
  
  Strictly, standardization helps when your univariate distribution is Gaussian, the units vary across features and your algorithm assumes units do not vary and univariate distributions are Gaussian.
  
  Results can be good or even better when you break this strict heuristic.
  
  Generally, I recommend trying a suite of different data scaling and transforms for a problem in order to flush out what representations work well for your problem.
  
  Reply
  - Goran January 23, 2017 at 9:30 pm #
    
    Moreover, is this the correct flow –
    
    Outlier removal -> Impute missing values -> Data treatment (Normalize etc) -> Check for correlations ?
    
    Reply
    - Jason Brownlee January 24, 2017 at 11:04 am #
      
      Looks good to me Goran!
      
      Reply
      - Goran January 24, 2017 at 9:43 pm #
        
        Thank you, Jason.
      - Skylar June 5, 2020 at 3:12 pm #
        
        Hi Jason,
        
        I have some doubt for the flow: should we first check the correlations between predictors to see collinearity, then do data treatment (normalization, etc)? I read your book and it follows this flow. Thank you!
      - Jason Brownlee June 6, 2020 at 7:45 am #
        
        Yes.
        
        Or perhaps perform treatment regardless and compare to a model fit on the raw data and see if it results in a lift in performance.
Natasa June 27, 2017 at 9:49 pm #

How is the preprocessing different when we have data from accelerometer signals which measure gait. For example the data consists of x,y,z which are the measures of the accelerometer, another column with miliseconds and a dependend variable which states the event, for example walking or sitting. In this case, do we have to create windows first and then start extracting features?

Reply
- Jason Brownlee June 28, 2017 at 6:24 am #
  
  Consider taking a look in the literature to see how this type of data has been prepared in other experiments.
  
  Reply
Grace August 4, 2017 at 5:28 am #

Hi Jason, I trained my NN model on pre-processed (normalized) data and then used the model to predict (the data I fed is also normalized). How do I convert the prediction results to be un-normalized so that it makes sense? Thanks.

Reply
- Jason Brownlee August 4, 2017 at 7:04 am #
  
  Yes, you can inverse the scaling transform. Sorry I do not have an example on hand.
  
  Reply
Vaibhav Nellore September 6, 2017 at 5:28 pm #

Hi Jason,

I used preProcess to standardize my train data set(data frame). Then, I used it and developed a model.

Then,to test my model on my test dataset(data frame), I need to standardize variables in test dataset with the same mean and std of variables in train dataset. (just, correct me if i am wrong!?) If so, Is there any package or method to do this?

Reply
- Jason Brownlee September 7, 2017 at 12:50 pm #
  
  I have examples of this in Python, but not R, sorry.
  
  Reply
moumtana October 22, 2017 at 3:33 pm #

Dear Brownlee,

Thank you for recipes they helped me so much. I am asking if there are R recipes to use

ISOMAP (nonlinear ) preprocessing.

Thank you in advance

Moumtana

Reply
- Jason Brownlee October 23, 2017 at 5:40 am #
  
  Perhaps check the documentation for the function:
  
  ?isomap
  
  Reply
abdal November 20, 2017 at 7:39 pm #

hi Jason, thanks for the tutorial, however i have one problem i am facing, after training and testing, i supply new inputs in standardized form to the neural networks to get new known outputs, these outputs are in standardized range as well, how do i get the outputs back to the original unstandardized range.

Reply
- Jason Brownlee November 22, 2017 at 10:39 am #
  
  You can reverse the standardization process.
  
  Reply
Noelia December 15, 2017 at 11:11 pm #

Hi Jason,

If I have a dataset of categorical variables in the same units but with vastly different values (like height at age 8 and age 25): is it recommendable to standardize them? I mean, I don’t know what to do because for all of them the unit represent the same distance.

Also, is there a kind of dataset where normalization is highly recommended to be applied?

I’m struggling with all this concepts and, although now I’m starting to understand them, I don’t know when I should apply each one.

Reply
- Jason Brownlee December 16, 2017 at 5:27 am #
  
  Those sound like numerical and not categorical values.
  
  Standardizing is a great idea if the variables are Gaussian, try it and see how it impacts model skill. If not, try normalizing.
  
  Reply
Abdou June 6, 2018 at 3:35 am #

Hi Jason,

I am struggling with reversing these transformations after predictions on the test sets. Is there a way in R to do the reversing the same way as Python’s fit_transform() and inverse_transform ()? Thank you

Reply
- Jason Brownlee June 6, 2018 at 6:42 am #
  
  Often the caret package will handle this for you.
  
  Reply
  - Marcin November 6, 2018 at 8:57 pm #
    
    Would it be possible for you to edit the current or write a separate blog post on how to use caret to inverse the pre-processing transformations after prediction. It would greatly enhance this post
    
    Reply
    - Jason Brownlee November 7, 2018 at 6:01 am #
      
      Thanks for the suggestion.
      
      Reply
Vijaya July 6, 2018 at 11:31 am #

Hi Jason,
I just want to know what are the dimension reduction technique for SVM using R and Python.

Reply
- Jason Brownlee July 7, 2018 at 6:10 am #
  
  Same as other methods, pca, svd, feature selection, sammons, som, tsne, etc.
  
  Reply
ani_weck November 3, 2018 at 10:06 am #

Hi Jason,
How can I see the results of the preProcess nn my input data? I just want to check what the different parameters do to my dataset (i.e., range vs scale).

Reply
- Jason Brownlee November 4, 2018 at 6:24 am #
  
  You may have to perform the data preparation operation separately.
  
  Reply
Ton February 21, 2019 at 6:21 pm #

Dear Jason!
If I use standalone method for pre-processing data, the pre-processing step will be performed again in the training step as a default option or not?

Reply
- Jason Brownlee February 22, 2019 at 6:15 am #
  
  Not sure I understand, sorry, can you please elaborate?
  
  Reply
laz June 29, 2019 at 10:33 am #

Dear Jason, thank you for your great articles.

I have question regarding caret’s pre-processing in Timeslice-Mode.

We split our data in equal train/test folds like this:

1] “———————————- train fold[1] | length 861”
# from 1 2 3 4 5 6
### to 856 857 858 859 860 861
[1] “———————————- train fold[2] | length 861”
# from 431 432 433 434 435 436
### to 1286 1287 1288 1289 1290 1291
[1] “———————————- train fold[3] | length 861”
# from 861 862 863 864 865 866
### to 1716 1717 1718 1719 1720 1721

[1] “———————————- test fold[1] | length 430”
# from 862 863 864 865 866 867
### to 1286 1287 1288 1289 1290 1291
[1] “———————————- test fold[2] | length 430”
# from 1292 1293 1294 1295 1296 1297
### to 1716 1717 1718 1719 1720 1721
[1] “———————————- test fold[3] | length 430”
# from 1722 1723 1724 1725 1726 1727
### to 2146 2147 2148 2149 2150 2151

and so on…

Now caret starts train on “train fold[1]”. With “center+scale” it uses “train fold[1]” for getting the mean and the stdDev of ALL train data from “train fold[1]”.

Caret calculates mean & stdDev from the COMPLETE TRAIN data, so if caret trains a model – this model already knows the future mean and stdDev?

My models are over-fitting the train data, i know there are a lot of possible reasons. But can this also be caused by this kind of data-snooping? Is that data-snooping?

Thank you!

Reply
- Jason Brownlee June 30, 2019 at 9:34 am #
  
  Not quite, it it uses k-fold cross validation. You can learn more here:
  https://machinelearningmastery.com/k-fold-cross-validation/
  
  Reply
Mano Brown June 29, 2019 at 7:21 pm #

Hi Jason,

Could you write a post in which you perform the standardization technique into the “Your First Machine Learning Project in R Step-By-Step (tutorial and template for future projects)”.

Kind redargs,

Mano Brown

Reply
- Jason Brownlee June 30, 2019 at 9:38 am #
  
  Thanks for the suggestion Mano. Perhaps in the future.
  
  Reply
Tanuja Joshi July 1, 2019 at 3:13 pm #

Hii Jason , I am currently working on the loan dataset which has the attribute interest rate (entries in %) . I want to normalize the data . so how should i handle this attribute?
Is it required to remove percentage before normalization ?

Reply
- Jason Brownlee July 2, 2019 at 7:28 am #
  
  Yes, machine learning algorithms must work with numbers.
  
  Reply
Anupam Mukherjee July 9, 2019 at 7:16 pm #

Thanks a lot for all the effort on teaching me ML 🙂

How do I apply scaling information on Train data to Test set? Or we can independently scale Train and test?

Reply
- Jason Brownlee July 10, 2019 at 8:06 am #
  
  I believe caret will manage this for you.
  
  Reply
  - Anupam Mukherjee July 17, 2019 at 7:24 pm #
    
    Thanks 🙂
    
    And in python?
    
    Reply
    - Jason Brownlee July 18, 2019 at 8:24 am #
      
      In Python, you can use a Pipeline:
      https://machinelearningmastery.com/automate-machine-learning-workflows-pipelines-python-scikit-learn/
      
      Reply
Shivendra July 24, 2019 at 8:49 pm #

Great stuff Jason! But do you have a tutorial on one hot encoding in R similar to the one you have written for Python? That would be great and would allow (me) to compare certain differences in the two languages.

Best,
Shivendra

Reply
- Jason Brownlee July 25, 2019 at 7:50 am #
  
  I don’t believe so, sorry.
  
  Reply
John August 15, 2019 at 11:44 am #

Python has StandardScaler which seems to be a combination of Center and Scale:
output = (x – u)/s
Is that that the equivalent of where you have:
method=c(“center”, “scale”))
Thanks!

Reply
- Jason Brownlee August 15, 2019 at 2:19 pm #
  
  I believe so.
  
  Reply
Bharat October 11, 2019 at 2:57 pm #

Hi Jason,

Is their any example you came across for KNN algorithm for mixed data types. How to handle factors(categorical data) with just 0 and 1. Gender “Male” and Female”. It is a classification problem.

Thanks
Bharat

Reply
- Jason Brownlee October 12, 2019 at 6:46 am #
  
  Yes, typically a hamming distance or boolean (same == 0 different == 1) is useful for categorical data, which can be added to the euclidean distance for the normalized numerical values.
  
  Reply
Sunil Kappal October 19, 2019 at 12:05 am #

Thanks a lot for this amazing article, I was thinking if there is a possibility or a way of using median and median absolute deviation to center the data especially where there are extreme outliers. I haven’t seen any data normalization technique or routine explicitly catering to this phenomenon during the normalization process.

Reply
- Jason Brownlee October 19, 2019 at 6:45 am #
  
  Sure, perhaps try it on your problem and see if it helps?
  
  Reply
Huaichao April 29, 2020 at 6:45 pm #

thank you for your sharing, i have on issue, every time, we conducted the center, there will be the negative values, which is not desired, but wheher this method is something errors in that

Reply
- Jason Brownlee April 30, 2020 at 6:40 am #
  
  If you don’t want negative values, you can use normalization instead.
  
  Reply
Balaji Sundararaman May 24, 2020 at 3:10 pm #

Hi Jason,
Thanks for this. How does one transform the preprocessed (eg scaled, centered, range) values back to the original after modelling and prediction. Is there a method for that in the caret package.
THanks.

Reply
- Jason Brownlee May 25, 2020 at 5:43 am #
  
  Good question, I’m not sure off hand.
  
  Reply
Ajax June 14, 2021 at 10:27 am #

Hi Jason,
What are you doing in the predict step below?
library(caret)
# load the dataset
data(iris)
# summarize data
summary(iris[,1:4])
# calculate the pre-process parameters from the dataset
preprocessParams <- preProcess(iris[,1:4], method=c("range"))
# summarize transform parameters
print(preprocessParams)
# transform the dataset using the parameters
transformed <- predict(preprocessParams, iris[,1:4])<—–What is being predicted here?
# summarize the transformed dataset
summary(transformed)

Reply
- Jason Brownlee June 15, 2021 at 6:02 am #
  
  We are transforming the input variables, specifically, we are scaling their values.
  
  Reply

Navigation

Get Your Data Ready For Machine Learning in R with Pre-Processing

Need For Data Pre-Processing

Data Pre-Processing Methods

Need more Help with R for Machine Learning?

Data Pre-Processing With Caret in R

Summary of Transform Methods

1. Scale

2. Center

3. Standardize

4. Normalize

5. Box-Cox Transform

6. Yeo-Johnson Transform

7. Principal Component Analysis

8. Independent Component Analysis

Tips For Data Transforms

Summary

Next Step

Discover Faster Machine Learning in R!

Develop Your Own Models in Minutes

Finally Bring Machine Learning To Your Own Projects

More On This Topic

61 Responses to Get Your Data Ready For Machine Learning in R with Pre-Processing

Leave a Reply Click here to cancel reply.