The post How to Selectively Scale Numerical Input Variables for Machine Learning appeared first on Machine Learning Mastery.

]]>It is convenient, and therefore common, to apply the same data transforms, such as standardization and normalization, equally to all input variables. This can achieve good results on many problems. Nevertheless, better results may be achieved by carefully **selecting which data transform to apply to each input variable** prior to modeling.

In this tutorial, you will discover how to apply selective scaling of numerical input variables.

After completing this tutorial, you will know:

- How to load and calculate a baseline predictive performance for the diabetes classification dataset.
- How to evaluate modeling pipelines with data transforms applied blindly to all numerical input variables.
- How to evaluate modeling pipelines with selective normalization and standardization applied to subsets of input variables.

Discover data cleaning, feature selection, data transforms, dimensionality reduction and much more in my new book, with 30 step-by-step tutorials and full Python source code.

Let’s get started.

This tutorial is divided into three parts; they are:

- Diabetes Numerical Dataset
- Non-Selective Scaling of Numerical Inputs
- Normalize All Input Variables
- Standardize All Input Variables

- Selective Scaling of Numerical Inputs
- Normalize Only Non-Gaussian Input Variables
- Standardize Only Gaussian-Like Input Variables
- Selectively Normalize and Standardize Input Variables

As the basis of this tutorial, we will use the so-called “diabetes” dataset that has been widely studied as a machine learning dataset since the 1990s.

The dataset classifies patients’ data as either an onset of diabetes within five years or not. There are 768 examples and eight input variables. It is a binary classification problem.

You can learn more about the dataset here:

- Diabetes Dataset (pima-indians-diabetes.csv)
- Diabetes Dataset Description (pima-indians-diabetes.names)

No need to download the dataset; we will download it automatically as part of the worked examples that follow.

Looking at the data, we can see that all nine input variables are numerical.

6,148,72,35,0,33.6,0.627,50,1 1,85,66,29,0,26.6,0.351,31,0 8,183,64,0,0,23.3,0.672,32,1 1,89,66,23,94,28.1,0.167,21,0 0,137,40,35,168,43.1,2.288,33,1 ...

We can load this dataset into memory using the Pandas library.

The example below downloads and summarizes the diabetes dataset.

# load and summarize the diabetes dataset from pandas import read_csv from pandas.plotting import scatter_matrix from matplotlib import pyplot # Load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv" dataset = read_csv(url, header=None) # summarize the shape of the dataset print(dataset.shape) # histograms of the variables dataset.hist() pyplot.show()

Running the example first downloads the dataset and loads it as a DataFrame.

The shape of the dataset is printed, confirming the number of rows, and nine variables, eight input, and one target.

(768, 9)

Finally, a plot is created showing a histogram for each variable in the dataset.

This is useful as we can see that some variables have a Gaussian or Gaussian-like distribution (1, 2, 5) and others have an exponential-like distribution (0, 3, 4, 6, 7). This may suggest the need for different numerical data transforms for the different types of input variables.

Now that we are a little familiar with the dataset, let’s try fitting and evaluating a model on the raw dataset.

We will use a logistic regression model as they are a robust and effective linear model for binary classification tasks. We will evaluate the model using repeated stratified k-fold cross-validation, a best practice, and use 10 folds and three repeats.

The complete example is listed below.

# evaluate a logistic regression model on the raw diabetes dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.linear_model import LogisticRegression # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv' dataframe = read_csv(url, header=None) data = dataframe.values # separate into input and output elements X, y = data[:, :-1], data[:, -1] # minimally prepare dataset X = X.astype('float') y = LabelEncoder().fit_transform(y.astype('str')) # define the model model = LogisticRegression(solver='liblinear') # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate the model m_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # summarize the result print('Accuracy: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Running the example evaluates the model and reports the mean and standard deviation accuracy for fitting a logistic regression model on the raw dataset.

Your specific results may differ given the stochastic nature of the learning algorithm, the stochastic nature of the evaluation procedure, and differences in precision across machines and library versions. Try running the example a few times.

In this case, we can see that the model achieved an accuracy of about 76.8 percent.

Accuracy: 0.768 (0.040)

Now that we have established a baseline in performance on the dataset, let’s see if we can improve the performance using data scaling.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Many algorithms prefer or require that input variables are scaled to a consistent range prior to fitting a model.

This includes the logistic regression model that assumes input variables have a Gaussian probability distribution. It may also provide a more numerically stable model if the input variables are standardized. Nevertheless, even when these expectations are violated, the logistic regression can perform well or best for a given dataset as may be the case for the diabetes dataset.

Two common techniques for scaling numerical input variables are normalization and standardization.

Normalization scales each input variable to the range 0-1 and can be implemented using the MinMaxScaler class in scikit-learn. Standardization scales each input variable to have a mean of 0.0 and a standard deviation of 1.0 and can be implemented using the StandardScaler class in scikit-learn.

To learn more about normalization, standardization, and how to use these methods in scikit-learn, see the tutorial:

A naive approach to data scaling applies a single transform to all input variables, regardless of their scale or probability distribution. And this is often effective.

Let’s try normalizing and standardizing all input variables directly and compare the performance to the baseline logistic regression model fit on the raw data.

We can update the baseline code example to use a modeling pipeline where the first step is to apply a scaler and the final step is to fit the model.

This ensures that the scaling operation is fit or prepared on the training set only and then applied to the train and test sets during the cross-validation process, avoiding data leakage. Data leakage can result in an optimistically biased estimate of model performance.

This can be achieved using the Pipeline class where each step in the pipeline is defined as a tuple with a name and the instance of the transform or model to use.

... # define the modeling pipeline scaler = MinMaxScaler() model = LogisticRegression(solver='liblinear') pipeline = Pipeline([('s',scaler),('m',model)])

Tying this together, the complete example of evaluating a logistic regression on diabetes dataset with all input variables normalized is listed below.

# evaluate a logistic regression model on the normalized diabetes dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.pipeline import Pipeline from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import MinMaxScaler # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv' dataframe = read_csv(url, header=None) data = dataframe.values # separate into input and output elements X, y = data[:, :-1], data[:, -1] # minimally prepare dataset X = X.astype('float') y = LabelEncoder().fit_transform(y.astype('str')) # define the modeling pipeline model = LogisticRegression(solver='liblinear') scaler = MinMaxScaler() pipeline = Pipeline([('s',scaler),('m',model)]) # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate the model m_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # summarize the result print('Accuracy: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Running the example evaluates the modeling pipeline and reports the mean and standard deviation accuracy for fitting a logistic regression model on the normalized dataset.

Your specific results may differ given the stochastic nature of the learning algorithm, the stochastic nature of the evaluation procedure, and differences in precision across machines and library versions. Try running the example a few times.

In this case, we can see that the normalization of the input variables has resulted in a drop in the mean classification accuracy from 76.8 percent with a model fit on the raw data to about 76.4 percent for the pipeline with normalization.

Accuracy: 0.764 (0.045)

Next, let’s try standardizing all input variables.

We can update the modeling pipeline to use standardization instead of normalization for all input variables prior to fitting and evaluating the logistic regression model.

This might be an appropriate transform for those input variables with a Gaussian-like distribution, but perhaps not the other variables.

... # define the modeling pipeline scaler = StandardScaler() model = LogisticRegression(solver='liblinear') pipeline = Pipeline([('s',scaler),('m',model)])

Tying this together, the complete example of evaluating a logistic regression model on diabetes dataset with all input variables standardized is listed below.

# evaluate a logistic regression model on the standardized diabetes dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.pipeline import Pipeline from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import StandardScaler # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv' dataframe = read_csv(url, header=None) data = dataframe.values # separate into input and output elements X, y = data[:, :-1], data[:, -1] # minimally prepare dataset X = X.astype('float') y = LabelEncoder().fit_transform(y.astype('str')) # define the modeling pipeline scaler = StandardScaler() model = LogisticRegression(solver='liblinear') pipeline = Pipeline([('s',scaler),('m',model)]) # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate the model m_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # summarize the result print('Accuracy: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Running the example evaluates the modeling pipeline and reports the mean and standard deviation accuracy for fitting a logistic regression model on the standardized dataset.

Your specific results may differ given the stochastic nature of the learning algorithm, the stochastic nature of the evaluation procedure, and differences in precision across machines and library versions. Try running the example a few times.

In this case, we can see that standardizing all numerical input variables has resulted in a lift in mean classification accuracy from 76.8 percent with a model evaluated on the raw dataset to about 77.2 percent for a model evaluated on the dataset with standardized input variables.

Accuracy: 0.772 (0.043)

So far, we have learned that normalizing all variables does not help performance, but standardizing all input variables does help performance.

Next, let’s explore if selectively applying scaling to the input variables can offer further improvement.

Data transforms can be applied selectively to input variables using the ColumnTransformer class in scikit-learn.

It allows you to specify the transform (or pipeline of transforms) to apply and the column indexes to apply them to. This can then be used as part of a modeling pipeline and evaluated using cross-validation.

You can learn more about how to use the ColumnTransformer in the tutorial:

We can explore using the ColumnTransformer to selectively apply normalization and standardization to the numerical input variables of the diabetes dataset in order to see if we can achieve further performance improvements.

First, let’s try normalizing just those input variables that do not have a Gaussian-like probability distribution and leave the rest of the input variables alone in the raw state.

We can define two groups of input variables using the column indexes, one for the variables with a Gaussian-like distribution, and one for the input variables with the exponential-like distribution.

... # define column indexes for the variables with "normal" and "exponential" distributions norm_ix = [1, 2, 5] exp_ix = [0, 3, 4, 6, 7]

We can then selectively normalize the “*exp_ix*” group and let the other input variables pass through without any data preparation.

... # define the selective transforms t = [('e', MinMaxScaler(), exp_ix)] selective = ColumnTransformer(transformers=t, remainder='passthrough')

The selective transform can then be used as part of our modeling pipeline.

... # define the modeling pipeline model = LogisticRegression(solver='liblinear') pipeline = Pipeline([('s',selective),('m',model)])

Tying this together, the complete example of evaluating a logistic regression model on data with selective normalization of some input variables is listed below.

# evaluate a logistic regression model on the diabetes dataset with selective normalization from numpy import mean from numpy import std from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.pipeline import Pipeline from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import MinMaxScaler from sklearn.compose import ColumnTransformer # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv' dataframe = read_csv(url, header=None) data = dataframe.values # separate into input and output elements X, y = data[:, :-1], data[:, -1] # minimally prepare dataset X = X.astype('float') y = LabelEncoder().fit_transform(y.astype('str')) # define column indexes for the variables with "normal" and "exponential" distributions norm_ix = [1, 2, 5] exp_ix = [0, 3, 4, 6, 7] # define the selective transforms t = [('e', MinMaxScaler(), exp_ix)] selective = ColumnTransformer(transformers=t, remainder='passthrough') # define the modeling pipeline model = LogisticRegression(solver='liblinear') pipeline = Pipeline([('s',selective),('m',model)]) # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate the model m_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # summarize the result print('Accuracy: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Running the example evaluates the modeling pipeline and reports the mean and standard deviation accuracy.

In this case, we can see slightly better performance, increasing mean accuracy with the baseline model fit on the raw dataset with 76.8 percent to about 76.9 with selective normalization of some input variables.

The results are not as good as standardizing all input variables though.

Accuracy: 0.769 (0.043)

We can repeat the experiment from the previous section, although in this case, selectively standardize those input variables that have a Gaussian-like distribution and leave the remaining input variables untouched.

... # define the selective transforms t = [('n', StandardScaler(), norm_ix)] selective = ColumnTransformer(transformers=t, remainder='passthrough')

Tying this together, the complete example of evaluating a logistic regression model on data with selective standardizing of some input variables is listed below.

# evaluate a logistic regression model on the diabetes dataset with selective standardization from numpy import mean from numpy import std from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.pipeline import Pipeline from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import StandardScaler from sklearn.compose import ColumnTransformer # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv' dataframe = read_csv(url, header=None) data = dataframe.values # separate into input and output elements X, y = data[:, :-1], data[:, -1] # minimally prepare dataset X = X.astype('float') y = LabelEncoder().fit_transform(y.astype('str')) # define column indexes for the variables with "normal" and "exponential" distributions norm_ix = [1, 2, 5] exp_ix = [0, 3, 4, 6, 7] # define the selective transforms t = [('n', StandardScaler(), norm_ix)] selective = ColumnTransformer(transformers=t, remainder='passthrough') # define the modeling pipeline model = LogisticRegression(solver='liblinear') pipeline = Pipeline([('s',selective),('m',model)]) # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate the model m_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # summarize the result print('Accuracy: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Running the example evaluates the modeling pipeline and reports the mean and standard deviation accuracy.

In this case, we can see that we achieved a lift in performance over both the baseline model fit on the raw dataset with 76.8 percent and over the standardization of all input variables that achieved 77.2 percent. With selective standardization, we have achieved a mean accuracy of about 77.3 percent, a modest but measurable bump.

Accuracy: 0.773 (0.041)

The results so far raise the question as to whether we can get a further lift by combining the use of selective normalization and standardization on the dataset at the same time.

This can be achieved by defining both transforms and their respective column indexes for the ColumnTransformer class, with no remaining variables being passed through.

... # define the selective transforms t = [('e', MinMaxScaler(), exp_ix), ('n', StandardScaler(), norm_ix)] selective = ColumnTransformer(transformers=t)

Tying this together, the complete example of evaluating a logistic regression model on data with selective normalization and standardization of the input variables is listed below.

# evaluate a logistic regression model on the diabetes dataset with selective scaling from numpy import mean from numpy import std from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.pipeline import Pipeline from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import MinMaxScaler from sklearn.preprocessing import StandardScaler from sklearn.compose import ColumnTransformer # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv' dataframe = read_csv(url, header=None) data = dataframe.values # separate into input and output elements X, y = data[:, :-1], data[:, -1] # minimally prepare dataset X = X.astype('float') y = LabelEncoder().fit_transform(y.astype('str')) # define column indexes for the variables with "normal" and "exponential" distributions norm_ix = [1, 2, 5] exp_ix = [0, 3, 4, 6, 7] # define the selective transforms t = [('e', MinMaxScaler(), exp_ix), ('n', StandardScaler(), norm_ix)] selective = ColumnTransformer(transformers=t) # define the modeling pipeline model = LogisticRegression(solver='liblinear') pipeline = Pipeline([('s',selective),('m',model)]) # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate the model m_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # summarize the result print('Accuracy: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Running the example evaluates the modeling pipeline and reports the mean and standard deviation accuracy.

In this case, interestingly, we can see that we have achieved the same performance as standardizing all input variables with 77.2 percent.

Further, the results suggest that the chosen model performs better when the non-Gaussian like variables are left as-is than being standardized or normalized.

I would not have guessed at this finding, which highlights the importance of careful experimentation.

Accuracy: 0.772 (0.040)

**Can you do better?**

Try other transforms or combinations of transforms and see if you can achieve better results.

Share your findings in the comments below.

This section provides more resources on the topic if you are looking to go deeper.

- Best Results for Standard Machine Learning Datasets
- How to Use the ColumnTransformer for Data Preparation
- How to Use StandardScaler and MinMaxScaler Transforms in Python

In this tutorial, you discovered how to apply selective scaling of numerical input variables.

Specifically, you learned:

- How to load and calculate a baseline predictive performance for the diabetes classification dataset.
- How to evaluate modeling pipelines with data transforms applied blindly to all numerical input variables.
- How to evaluate modeling pipelines with selective normalization and standardization applied to subsets of input variables.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post How to Selectively Scale Numerical Input Variables for Machine Learning appeared first on Machine Learning Mastery.

]]>The post Add Binary Flags for Missing Values for Machine Learning appeared first on Machine Learning Mastery.

]]>A common approach is to replace missing values with a calculated statistic, such as the mean of the column. This allows the dataset to be modeled as per normal but gives no indication to the model that the row original contained missing values.

One approach to address this issue is to include additional binary flag input features that indicate whether a row or a column contained a missing value that was imputed. This additional information may or may not be helpful to the model in predicting the target value.

In this tutorial, you will discover how to **add binary flags for missing values** for modeling.

After completing this tutorial, you will know:

- How to load and evaluate models with statistical imputation on a classification dataset with missing values.
- How to add a flag that indicates if a row has one more missing values and evaluate models with this new feature.
- How to add a flag for each input variable that has missing values and evaluate models with these new features.

Discover data cleaning, feature selection, data transforms, dimensionality reduction and much more in my new book, with 30 step-by-step tutorials and full Python source code.

Let’s get started.

**Updated Jul/2020**: Fixed bug in the creation of the flag variable.

This tutorial is divided into three parts; they are:

- Imputing the Horse Colic Dataset
- Model With a Binary Flag for Missing Values
- Model With Indicators of All Missing Values

The horse colic dataset describes medical characteristics of horses with colic and whether they lived or died.

There are 300 rows and 26 input variables with one output variable. It is a binary classification prediction task that involves predicting 1 if the horse lived and 2 if the horse died.

There are many fields we could select to predict in this dataset. In this case, we will predict whether the problem was surgical or not (column index 23), making it a binary classification problem.

The dataset has numerous missing values for many of the columns where each missing value is marked with a question mark character (“?”).

Below provides an example of rows from the dataset with marked missing values.

2,1,530101,38.50,66,28,3,3,?,2,5,4,4,?,?,?,3,5,45.00,8.40,?,?,2,2,11300,00000,00000,2 1,1,534817,39.2,88,20,?,?,4,1,3,4,2,?,?,?,4,2,50,85,2,2,3,2,02208,00000,00000,2 2,1,530334,38.30,40,24,1,1,3,1,3,3,1,?,?,?,1,1,33.00,6.70,?,?,1,2,00000,00000,00000,1 1,9,5290409,39.10,164,84,4,1,6,2,2,4,4,1,2,5.00,3,?,48.00,7.20,3,5.30,2,1,02208,00000,00000,1 ...

You can learn more about the dataset here:

No need to download the dataset as we will download it automatically in the worked examples.

Marking missing values with a NaN (not a number) value in a loaded dataset using Python is a best practice.

We can load the dataset using the read_csv() Pandas function and specify the “*na_values*” to load values of ‘?’ as missing, marked with a NaN value.

The example below downloads the dataset, marks “?” values as NaN (missing) and summarizes the shape of the dataset.

# summarize the horse colic dataset from pandas import read_csv # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv' dataframe = read_csv(url, header=None, na_values='?') data = dataframe.values # split into input and output elements ix = [i for i in range(data.shape[1]) if i != 23] X, y = data[:, ix], data[:, 23] print(X.shape, y.shape)

Running the example downloads the dataset and reports the number of rows and columns, matching our expectations.

(300, 27) (300,)

Next, we can evaluate a model on this dataset.

We can use the SimpleImputer class to perform statistical imputation and replace the missing values with the mean of each column. We can then fit a random forest model on the dataset.

For more on how to use the SimpleImputer class, see the tutorial:

To achieve this, we will define a pipeline that first performs imputation, then fits the model and evaluates this modeling pipeline using repeated stratified k-fold cross-validation with three repeats and 10 folds.

The complete example is listed below.

# evaluate mean imputation and random forest for the horse colic dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.ensemble import RandomForestClassifier from sklearn.impute import SimpleImputer from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.pipeline import Pipeline # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv' dataframe = read_csv(url, header=None, na_values='?') # split into input and output elements data = dataframe.values ix = [i for i in range(data.shape[1]) if i != 23] X, y = data[:, ix], data[:, 23] # define modeling pipeline model = RandomForestClassifier() imputer = SimpleImputer() pipeline = Pipeline(steps=[('i', imputer), ('m', model)]) # define model evaluation cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1) print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example evaluates the random forest with mean statistical imputation on the horse colic dataset.

Your specific results may vary given the stochastic nature of the learning algorithm, the stochastic nature of the evaluation procedure, and differences in precision across machines. Try running the example a few times.

In this case, the pipeline achieved an estimated classification accuracy of about 86.2 percent.

Mean Accuracy: 0.862 (0.056)

Next, let’s see if we can improve the performance of the model by providing more information about missing values.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

In the previous section, we replaced missing values with a calculated statistic.

The model is unaware that missing values were replaced.

It is possible that knowledge of whether a row contains a missing value or not will be useful to the model when making a prediction.

One approach to exposing the model to this knowledge is by providing an additional column that is a binary flag indicating whether the row had a missing value or not.

- 0: Row does not contain a missing value.
- 1: Row contains a missing value (which was/will be imputed).

This can be achieved directly on the loaded dataset. First, we can sum the values for each row to create a new column where if the row contains at least one NaN, then the sum will be a NaN.

We can then mark all values in the new column as 1 if they contain a NaN, or 0 otherwise.

Finally, we can add this column to the loaded dataset.

Tying this together, the complete example of adding a binary flag to indicate one or more missing values in each row is listed below.

# add a binary flag that indicates if a row contains a missing value from numpy import isnan from numpy import hstack from pandas import read_csv # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv' dataframe = read_csv(url, header=None, na_values='?') # split into input and output elements data = dataframe.values ix = [i for i in range(data.shape[1]) if i != 23] X, y = data[:, ix], data[:, 23] print(X.shape) # sum each row where rows with a nan will sum to nan a = X.sum(axis=1) # mark all non-nan as 0 a[~isnan(a)] = 0 # mark all nan as 1 a[isnan(a)] = 1 a = a.reshape((len(a), 1)) # add to the dataset as another column X = hstack((X, a)) print(X.shape)

Running the example first downloads the dataset and reports the number of rows and columns, as expected.

Then the new binary variable indicating whether a row contains a missing value is created and added to the end of the input variables. The shape of the input data is then reported, confirming the addition of the feature, from 27 to 28 columns.

(300, 27) (300, 28)

We can then evaluate the model as we did in the previous section with the additional binary flag and see if it impacts model performance.

The complete example is listed below.

# evaluate model performance with a binary flag for missing values and imputed missing from numpy import isnan from numpy import hstack from numpy import mean from numpy import std from pandas import read_csv from sklearn.ensemble import RandomForestClassifier from sklearn.impute import SimpleImputer from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.pipeline import Pipeline # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv' dataframe = read_csv(url, header=None, na_values='?') # split into input and output elements data = dataframe.values ix = [i for i in range(data.shape[1]) if i != 23] X, y = data[:, ix], data[:, 23] # sum each row where rows with a nan will sum to nan a = X.sum(axis=1) # mark all non-nan as 0 a[~isnan(a)] = 0 # mark all nan as 1 a[isnan(a)] = 1 a = a.reshape((len(a), 1)) # add to the dataset as another column X = hstack((X, a)) # define modeling pipeline model = RandomForestClassifier() imputer = SimpleImputer() pipeline = Pipeline(steps=[('i', imputer), ('m', model)]) # define model evaluation cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1) print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example reports the mean and standard deviation classification accuracy on the horse colic dataset with the additional feature and imputation.

Your specific results may vary given the stochastic nature of the learning algorithm, the stochastic nature of the evaluation procedure, and differences in precision across machines. Try running the example a few times.

In this case, we see a modest lift in performance from 86.2 percent to 86.3 percent. The difference is small and may not be statistically significant.

Mean Accuracy: 0.863 (0.055)

Most rows in this dataset have a missing value, and this approach might be more beneficial on datasets with fewer missing values.

Next, let’s see if we can provide even more information about the missing values to the model.

In the previous section, we added one additional column to indicate whether a row contains a missing value or not.

One step further is to indicate whether each input value was missing and imputed or not. This effectively adds one additional column for each input variable that contains missing values and may offer benefit to the model.

This can be achieved by setting the “*add_indicator*” argument to *True* when defining the SimpleImputer instance.

... # impute and mark missing values X = SimpleImputer(add_indicator=True).fit_transform(X)

We can demonstrate this with a worked example.

The example below loads the horse colic dataset as before, then imputes the missing values on the entire dataset and adds indicators variables for each input variable that has missing values

# impute and add indicators for columns with missing values from pandas import read_csv from sklearn.impute import SimpleImputer # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv' dataframe = read_csv(url, header=None, na_values='?') data = dataframe.values # split into input and output elements ix = [i for i in range(data.shape[1]) if i != 23] X, y = data[:, ix], data[:, 23] print(X.shape) # impute and mark missing values X = SimpleImputer(strategy='mean', add_indicator=True).fit_transform(X) print(X.shape)

Running the example first downloads and summarizes the shape of the dataset as expected, then applies the imputation and adds the binary (1 and 0 values) columns indicating whether each row contains a missing value for a given input variable.

We can see that the number of input variables has increased from 27 to 48, indicating the addition of 21 binary input variables, and in turn, that 21 of the 27 input variables must contain at least one missing value.

(300, 27) (300, 48)

Next, we can evaluate the model with this additional information.

The complete example below demonstrates this.

# evaluate imputation with added indicators features on the horse colic dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.ensemble import RandomForestClassifier from sklearn.impute import SimpleImputer from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.pipeline import Pipeline # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv' dataframe = read_csv(url, header=None, na_values='?') # split into input and output elements data = dataframe.values ix = [i for i in range(data.shape[1]) if i != 23] X, y = data[:, ix], data[:, 23] # define modeling pipeline model = RandomForestClassifier() imputer = SimpleImputer(add_indicator=True) pipeline = Pipeline(steps=[('i', imputer), ('m', model)]) # define model evaluation cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1) print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example reports the mean and standard deviation classification accuracy on the horse colic dataset with the additional indicators features and imputation.

Your specific results may vary given the stochastic nature of the learning algorithm, the stochastic nature of the evaluation procedure, and differences in precision across machines. Try running the example a few times.

In this case, we see a nice lift in performance from 86.3 percent in the previous section to 86.7 percent.

This may provide strong evidence that adding one flag per column that was inputted is a better strategy on this dataset and chosen model.

Mean Accuracy: 0.867 (0.055)

This section provides more resources on the topic if you are looking to go deeper.

- Best Results for Standard Machine Learning Datasets
- Statistical Imputation for Missing Values in Machine Learning
- How to Handle Missing Data with Python

In this tutorial, you discovered how to add binary flags for missing values for modeling.

Specifically, you learned:

- How to load and evaluate models with statistical imputation on a classification dataset with missing values.
- How to add a flag that indicates if a row has one more missing values and evaluate models with this new feature.
- How to add a flag for each input variable that has missing values and evaluate models with these new features.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Add Binary Flags for Missing Values for Machine Learning appeared first on Machine Learning Mastery.

]]>The post How to Create Custom Data Transforms for Scikit-Learn appeared first on Machine Learning Mastery.

]]>There are many simple data cleaning operations, such as removing outliers and removing columns with few observations, that are often performed manually to the data, requiring custom code.

The scikit-learn library provides a way to wrap these **custom data transforms** in a standard way so they can be used just like any other transform, either on data directly or as a part of a modeling pipeline.

In this tutorial, you will discover how to define and use custom data transforms for scikit-learn.

After completing this tutorial, you will know:

- That custom data transforms can be created for scikit-learn using the FunctionTransformer class.
- How to develop and apply a custom transform to remove columns with few unique values.
- How to develop and apply a custom transform that replaces outliers for each column.

Discover data cleaning, feature selection, data transforms, dimensionality reduction and much more in my new book, with 30 step-by-step tutorials and full Python source code.

Let’s get started.

This tutorial is divided into four parts; they are:

- Custom Data Transforms in Scikit-Learn
- Oil Spill Dataset
- Custom Transform to Remove Columns
- Custom Transform to Replace Outliers

Data preparation refers to changing the raw data in some way that makes it more appropriate for predictive modeling with machine learning algorithms.

The scikit-learn Python machine learning library offers many different data preparation techniques directly, such as techniques for scaling numerical input variables and changing the probability distribution of variables.

These transforms can be fit and then applied on a dataset or used as part of a predictive modeling pipeline, allowing a sequence of transforms to be applied correctly without data leakage when evaluating model performance with data sampling techniques, such as k-fold cross-validation.

Although the data preparation techniques available in scikit-learn are extensive, there may be additional data preparation steps that are required.

Typically, these additional steps are performed manually prior to modeling and require writing custom code. The risk is that these data preparation steps may be performed inconsistently.

The solution is to create a custom data transform in scikit-learn using the FunctionTransformer class.

This class allows you to specify a function that is called to transform the data. You can define the function and perform any valid change, such as changing values or removing columns of data (not removing rows).

The class can then be used just like any other data transform in scikit-learn, e.g. to transform data directly, or used in a modeling pipeline.

The catch is that **the transform is stateless**, meaning that no state can be kept.

This means that the transform cannot be used to calculate statistics on the training dataset that are then used to transform the train and test datasets.

In addition to custom scaling operations, this can be helpful for standard data cleaning operations, such as identifying and removing columns with few unique values and identifying and removing relative outliers.

We will explore both of these cases, but first, let’s define a dataset that we can use as the basis for exploration.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

The so-called “oil spill” dataset is a standard machine learning dataset.

The task involves predicting whether a patch contains an oil spill or not, e.g. from the illegal or accidental dumping of oil in the ocean, given a vector that describes the contents of a patch of a satellite image.

There are 937 cases. Each case is composed of 48 numerical computer vision derived features, a patch number, and a class label.

The normal case is no oil spill assigned the class label of 0, whereas an oil spill is indicated by a class label of 1. There are 896 cases for no oil spill and 41 cases of an oil spill.

You can access the entire dataset here:

Review the contents of the file.

The first few lines of the file should look as follows:

1,2558,1506.09,456.63,90,6395000,40.88,7.89,29780,0.19,214.7,0.21,0.26,0.49,0.1,0.4,99.59,32.19,1.84,0.16,0.2,87.65,0,0.47,132.78,-0.01,3.78,0.22,3.2,-3.71,-0.18,2.19,0,2.19,310,16110,0,138.68,89,69,2850,1000,763.16,135.46,3.73,0,33243.19,65.74,7.95,1 2,22325,79.11,841.03,180,55812500,51.11,1.21,61900,0.02,901.7,0.02,0.03,0.11,0.01,0.11,6058.23,4061.15,2.3,0.02,0.02,87.65,0,0.58,132.78,-0.01,3.78,0.84,7.09,-2.21,0,0,0,0,704,40140,0,68.65,89,69,5750,11500,9593.48,1648.8,0.6,0,51572.04,65.73,6.26,0 3,115,1449.85,608.43,88,287500,40.42,7.34,3340,0.18,86.1,0.21,0.32,0.5,0.17,0.34,71.2,16.73,1.82,0.19,0.29,87.65,0,0.46,132.78,-0.01,3.78,0.7,4.79,-3.36,-0.23,1.95,0,1.95,29,1530,0.01,38.8,89,69,1400,250,150,45.13,9.33,1,31692.84,65.81,7.84,1 4,1201,1562.53,295.65,66,3002500,42.4,7.97,18030,0.19,166.5,0.21,0.26,0.48,0.1,0.38,120.22,33.47,1.91,0.16,0.21,87.65,0,0.48,132.78,-0.01,3.78,0.84,6.78,-3.54,-0.33,2.2,0,2.2,183,10080,0,108.27,89,69,6041.52,761.58,453.21,144.97,13.33,1,37696.21,65.67,8.07,1 5,312,950.27,440.86,37,780000,41.43,7.03,3350,0.17,232.8,0.15,0.19,0.35,0.09,0.26,289.19,48.68,1.86,0.13,0.16,87.65,0,0.47,132.78,-0.01,3.78,0.02,2.28,-3.44,-0.44,2.19,0,2.19,45,2340,0,14.39,89,69,1320.04,710.63,512.54,109.16,2.58,0,29038.17,65.66,7.35,0 ...

We can see that the first column contains integers for the patch number. We can also see that the computer vision derived features are real-valued with differing scales, such as thousands in the second column and fractions in other columns.

This dataset contains columns with very few unique values and columns with outliers that provide a good basis for data cleaning.

The example below downloads the dataset and loads it as a numPy array and summarizes the number of rows and columns.

# load the oil dataset from pandas import read_csv # define the location of the dataset path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/oil-spill.csv' # load the dataset df = read_csv(path, header=None) # split data into inputs and outputs data = df.values X = data[:, :-1] y = data[:, -1] print(X.shape, y.shape)

Running the example loads the dataset and confirms the expected number of rows and columns.

(937, 49) (937,)

Now that we have a dataset that we can use as the basis for data transforms, let’s look at how we can define some custom data cleaning transforms using the *FunctionTransformer* class.

Columns that have few unique values are probably not contributing anything useful to predicting the target value.

This is not absolutely true, but it is true enough that you should test the performance of your model fit on a dataset with columns of this type removed.

This is a type of data cleaning, and there is a data transform provided in scikit-learn called the VarianceThreshold that attempts to address this using the variance of each column.

Another approach is to remove columns that have fewer than a specified number of unique values, such as 1.

We can develop a function that applies this transform and use the minimum number of unique values as a configurable default argument. We will also add some debugging to confirm it is working as we expect.

First, the number of unique values for each column can be calculated. Ten columns with equal or fewer than the minimum number of unique values can be identified. Finally, those identified columns can be removed from the dataset.

The *cust_transform()* function below implements this.

# remove columns with few unique values def cust_transform(X, min_values=1, verbose=True): # get number of unique values for each column counts = [len(unique(X[:, i])) for i in range(X.shape[1])] if verbose: print('Unique Values: %s' % counts) # select columns to delete to_del = [i for i,v in enumerate(counts) if v <= min_values] if verbose: print('Deleting: %s' % to_del) if len(to_del) is 0: return X # select all but the columns that are being removed ix = [i for i in range(X.shape[1]) if i not in to_del] result = X[:, ix] return result

We can then use this function in the FunctionTransformer.

A limitation of this transform is that it selects columns to delete based on the provided data. This means if a train and test dataset differ greatly, then it is possible for different columns to be removed from each, making model evaluation challenging (*unstable*!?). As such, it is best to keep the minimum number of unique values small, such as 1.

We can use this transform on the oil spill dataset. The complete example is listed below.

# custom data transform for removing columns with few unique values from numpy import unique from pandas import read_csv from sklearn.preprocessing import FunctionTransformer from sklearn.preprocessing import LabelEncoder # load a dataset def load_dataset(path): # load the dataset df = read_csv(path, header=None) data = df.values # split data into inputs and outputs X, y = data[:, :-1], data[:, -1] # minimally prepare dataset X = X.astype('float') y = LabelEncoder().fit_transform(y.astype('str')) return X, y # remove columns with few unique values def cust_transform(X, min_values=1, verbose=True): # get number of unique values for each column counts = [len(unique(X[:, i])) for i in range(X.shape[1])] if verbose: print('Unique Values: %s' % counts) # select columns to delete to_del = [i for i,v in enumerate(counts) if v <= min_values] if verbose: print('Deleting: %s' % to_del) if len(to_del) is 0: return X # select all but the columns that are being removed ix = [i for i in range(X.shape[1]) if i not in to_del] result = X[:, ix] return result # define the location of the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/oil-spill.csv' # load the dataset X, y = load_dataset(url) print(X.shape, y.shape) # define the transformer trans = FunctionTransformer(cust_transform) # apply the transform X = trans.fit_transform(X) # summarize new shape print(X.shape)

Running the example first reports the number of rows and columns in the raw dataset.

Next, a list is printed that shows the number of unique values observed for each column in the dataset. We can see that many columns have very few unique values.

The columns with one (or fewer) unique values are then identified and reported. In this case, column index 22. This column is removed from the dataset.

Finally, the shape of the transformed dataset is reported, showing 48 instead of 49 columns, confirming that the column with a single unique value was deleted.

(937, 49) (937,) Unique Values: [238, 297, 927, 933, 179, 375, 820, 618, 561, 57, 577, 59, 73, 107, 53, 91, 893, 810, 170, 53, 68, 9, 1, 92, 9, 8, 9, 308, 447, 392, 107, 42, 4, 45, 141, 110, 3, 758, 9, 9, 388, 220, 644, 649, 499, 2, 937, 169, 286] Deleting: [22] (937, 48)

There are many extensions you could explore to this transform, such as:

- Ensure that it is only applied to numerical input variables.
- Experiment with a different minimum number of unique values.
- Use a percentage rather than an absolute number of unique values.

If you explore any of these extensions, let me know in the comments below.

Next, let’s look at a transform that replaces values in the dataset.

Outliers are observations that are different or unlike the other observations.

If we consider one variable at a time, an outlier would be a value that is far from the center of mass (the rest of the values), meaning it is rare or has a low probability of being observed.

There are standard ways for identifying outliers for common probability distributions. For Gaussian data, we can identify outliers as observations that are three or more standard deviations from the mean.

This may or may not be a desirable way to identify outliers for data that has many input variables, yet can be effective in some cases.

We can identify outliers in this way and replace their value with a correction, such as the mean.

Each column is considered one at a time and mean and standard deviation statistics are calculated. Using these statistics, upper and lower bounds of “*normal*” values are defined, then all values that fall outside these bounds can be identified. If one or more outliers are identified, their values are then replaced with the mean value that was already calculated.

The *cust_transform()* function below implements this as a function applied to the dataset, where we parameterize the number of standard deviations from the mean and whether or not debug information will be displayed.

# replace outliers def cust_transform(X, n_stdev=3, verbose=True): # copy the array result = X.copy() # enumerate each column for i in range(result.shape[1]): # retrieve values for column col = X[:, i] # calculate statistics mu, sigma = mean(col), std(col) # define bounds lower, upper = mu-(sigma*n_stdev), mu+(sigma*n_stdev) # select indexes that are out of bounds ix = where(logical_or(col < lower, col > upper))[0] if verbose and len(ix) > 0: print('>col=%d, outliers=%d' % (i, len(ix))) # replace values result[ix, i] = mu return result

We can then use this function in the FunctionTransformer.

The method of outlier detection assumes a Gaussian probability distribution and applies to each variable independently, both of which are strong assumptions.

An additional limitation of this implementation is that the mean and standard deviation statistics are calculated on the provided dataset, meaning that the definition of an outlier and its replacement value are both relative to the dataset. This means that different definitions of outliers and different replacement values could be used if the transform is used on the train and test sets.

We can use this transform on the oil spill dataset. The complete example is listed below.

# custom data transform for replacing outliers from numpy import mean from numpy import std from numpy import where from numpy import logical_or from pandas import read_csv from sklearn.preprocessing import FunctionTransformer from sklearn.preprocessing import LabelEncoder # load a dataset def load_dataset(path): # load the dataset df = read_csv(path, header=None) data = df.values # split data into inputs and outputs X, y = data[:, :-1], data[:, -1] # minimally prepare dataset X = X.astype('float') y = LabelEncoder().fit_transform(y.astype('str')) return X, y # replace outliers def cust_transform(X, n_stdev=3, verbose=True): # copy the array result = X.copy() # enumerate each column for i in range(result.shape[1]): # retrieve values for column col = X[:, i] # calculate statistics mu, sigma = mean(col), std(col) # define bounds lower, upper = mu-(sigma*n_stdev), mu+(sigma*n_stdev) # select indexes that are out of bounds ix = where(logical_or(col < lower, col > upper))[0] if verbose and len(ix) > 0: print('>col=%d, outliers=%d' % (i, len(ix))) # replace values result[ix, i] = mu return result # define the location of the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/oil-spill.csv' # load the dataset X, y = load_dataset(url) print(X.shape, y.shape) # define the transformer trans = FunctionTransformer(cust_transform) # apply the transform X = trans.fit_transform(X) # summarize new shape print(X.shape)

Running the example first reports the shape of the dataset prior to any change.

Next, the number of outliers for each column is calculated and only those columns with one or more outliers are reported in the output. We can see that a total of 32 columns in the dataset have one or more outliers.

The outliers are then removed and the shape of the resulting dataset is reported, confirming no change in the number of rows or columns.

(937, 49) (937,) >col=0, outliers=10 >col=1, outliers=8 >col=3, outliers=8 >col=5, outliers=7 >col=6, outliers=1 >col=7, outliers=12 >col=8, outliers=15 >col=9, outliers=14 >col=10, outliers=19 >col=11, outliers=17 >col=12, outliers=22 >col=13, outliers=2 >col=14, outliers=16 >col=15, outliers=8 >col=16, outliers=8 >col=17, outliers=6 >col=19, outliers=12 >col=20, outliers=20 >col=27, outliers=14 >col=28, outliers=18 >col=29, outliers=2 >col=30, outliers=13 >col=32, outliers=3 >col=34, outliers=14 >col=35, outliers=15 >col=37, outliers=13 >col=40, outliers=18 >col=41, outliers=13 >col=42, outliers=12 >col=43, outliers=12 >col=44, outliers=19 >col=46, outliers=21 (937, 49)

There are many extensions you could explore to this transform, such as:

- Ensure that it is only applied to numerical input variables.
- Experiment with a different number of standard deviations from the mean, such as 2 or 4.
- Use a different definition of outlier, such as the IQR or a model.

If you explore any of these extensions, let me know in the comments below.

This section provides more resources on the topic if you are looking to go deeper.

- How to Remove Outliers for Machine Learning
- How to Perform Data Cleaning for Machine Learning with Python

In this tutorial, you discovered how to define and use custom data transforms for scikit-learn.

Specifically, you learned:

- That custom data transforms can be created for scikit-learn using the FunctionTransformer class.
- How to develop and apply a custom transform to remove columns with few unique values.
- How to develop and apply a custom transform that replaces outliers for each column.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post How to Create Custom Data Transforms for Scikit-Learn appeared first on Machine Learning Mastery.

]]>The post How to Grid Search Data Preparation Techniques appeared first on Machine Learning Mastery.

]]>The most common approach to data preparation is to study a dataset and review the expectations of a machine learning algorithms, then carefully choose the most appropriate data preparation techniques to transform the raw data to best meet the expectations of the algorithm. This is slow, expensive, and requires a vast amount of expertise.

An alternative approach to data preparation is to grid search a suite of common and commonly useful data preparation techniques to the raw data. This is an alternative philosophy for data preparation that **treats data transforms as another hyperparameter** of the modeling pipeline to be searched and tuned.

This approach requires less expertise than the traditional manual approach to data preparation, although it is computationally costly. The benefit is that it can aid in the discovery of non-intuitive data preparation solutions that achieve good or best performance for a given predictive modeling problem.

In this tutorial, you will discover how to use the grid search approach for data preparation with tabular data.

After completing this tutorial, you will know:

- Grid search provides an alternative approach to data preparation for tabular data, where transforms are tried as hyperparameters of the modeling pipeline.
- How to use the grid search method for data preparation to improve model performance over a baseline for a standard classification dataset.
- How to grid search sequences of data preparation methods to further improve model performance.

Let’s get started.

This tutorial is divided into three parts; they are:

- Grid Search Technique for Data Preparation
- Dataset and Performance Baseline
- Wine Classification Dataset
- Baseline Model Performance

- Grid Search Approach to Data Preparation

Data preparation can be challenging.

The approach that is most often prescribed and followed is to analyze the dataset, review the requirements of the algorithms, and transform the raw data to best meet the expectations of the algorithms.

This can be effective but is also slow and can require deep expertise with data analysis and machine learning algorithms.

An alternative approach is to treat the preparation of input variables as a hyperparameter of the modeling pipeline and to tune it along with the choice of algorithm and algorithm configurations.

This might be a data transform that “*should not work*” or “*should not be appropriate for the algorithm*” yet results in good or great performance. Alternatively, it may be the absence of a data transform for an input variable that is deemed “*absolutely required*” yet results in good or great performance.

This can be achieved by designing a **grid search of data preparation techniques** and/or sequences of data preparation techniques in pipelines. This may involve evaluating each on a single chosen machine learning algorithm, or on a suite of machine learning algorithms.

The benefit of this approach is that it always results in suggestions of modeling pipelines that give good relative results. Most importantly, it can unearth the non-obvious and unintuitive solutions to practitioners without the need for deep expertise.

We can explore this approach to data preparation with a worked example.

Before we dive into a worked example, let’s first select a standard dataset and develop a baseline in performance.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

In this section, we will first select a standard machine learning dataset and establish a baseline in performance on this dataset. This will provide the context for exploring the grid search method of data preparation in the next section.

We will use the wine classification dataset.

This dataset has 13 input variables that describe the chemical composition of samples of wine and requires that the wine be classified as one of three types.

You can learn more about the dataset here:

No need to download the dataset as we will download it automatically as part of our worked examples.

Open the dataset and review the raw data. The first few rows of data are listed below.

We can see that it is a multi-class classification predictive modeling problem with numerical input variables, each of which has different scales.

14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.64,1.04,3.92,1065,1 13.2,1.78,2.14,11.2,100,2.65,2.76,.26,1.28,4.38,1.05,3.4,1050,1 13.16,2.36,2.67,18.6,101,2.8,3.24,.3,2.81,5.68,1.03,3.17,1185,1 14.37,1.95,2.5,16.8,113,3.85,3.49,.24,2.18,7.8,.86,3.45,1480,1 13.24,2.59,2.87,21,118,2.8,2.69,.39,1.82,4.32,1.04,2.93,735,1 ...

The example below loads the dataset and splits it into the input and output columns, then summarizes the data arrays.

# example of loading and summarizing the wine dataset from pandas import read_csv # define the location of the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/wine.csv' # load the dataset as a data frame df = read_csv(url, header=None) # retrieve the numpy array data = df.values # split the columns into input and output variables X, y = data[:, :-1], data[:, -1] # summarize the shape of the loaded data print(X.shape, y.shape)

Running the example, we can see that the dataset was loaded correctly and that there are 179 rows of data with 13 input variables and a single target variable.

(178, 13) (178,)

Next, let’s evaluate a model on this dataset and establish a baseline in performance.

We can establish a baseline in performance on the wine classification task by evaluating a model on the raw input data.

In this case, we will evaluate a logistic regression model.

First, we can define a function to load the dataset and perform some minimal data preparation to ensure the inputs are numeric and the target is label encoded.

# prepare the dataset def load_dataset(): # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/wine.csv' df = read_csv(url, header=None) data = df.values X, y = data[:, :-1], data[:, -1] # minimally prepare dataset X = X.astype('float') y = LabelEncoder().fit_transform(y.astype('str')) return X, y

We will evaluate the model using the gold standard of repeated stratified k-fold cross-validation with 10 folds and three repeats.

# evaluate a model def evaluate_model(X, y, model): # define the cross-validation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores

We can then call the function to load the dataset, define our model, then evaluate it, reporting the mean and standard deviation accuracy.

... # get the dataset X, y = load_dataset() # define the model model = LogisticRegression(solver='liblinear') # evaluate the model scores = evaluate_model(X, y, model) # report performance print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Tying this together, the complete example of evaluating a logistic regression model on the raw wine classification dataset is listed below.

# baseline model performance on the wine dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression # prepare the dataset def load_dataset(): # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/wine.csv' df = read_csv(url, header=None) data = df.values X, y = data[:, :-1], data[:, -1] # minimally prepare dataset X = X.astype('float') y = LabelEncoder().fit_transform(y.astype('str')) return X, y # evaluate a model def evaluate_model(X, y, model): # define the cross-validation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # get the dataset X, y = load_dataset() # define the model model = LogisticRegression(solver='liblinear') # evaluate the model scores = evaluate_model(X, y, model) # report performance print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example evaluates the model performance and reports the mean and standard deviation classification accuracy.

Your results may vary given the stochastic nature of the learning algorithm, the evaluation procedure, and differences in precision across machines. Try running the example a few times.

In this case, we can see that the logistic regression model fit on the raw input data achieved the average classification accuracy of about 95.3 percent, providing a baseline in performance.

Accuracy: 0.953 (0.048)

Next, let’s explore whether we can improve the performance using the grid-search-based approach to data preparation.

In this section, we can explore whether we can improve performance using the grid search approach to data preparation.

The first step is to define a series of modeling pipelines to evaluate, where each pipeline defines one (or more) data preparation techniques and ends with a model that takes the transformed data as input.

We will define a function to create these pipelines as a list of tuples, where each tuple defines the short name for the pipeline and the pipeline itself. We will evaluate a range of different data scaling methods (e.g. MinMaxScaler and StandardScaler), distribution transforms (QuantileTransformer and KBinsDiscretizer), as well as dimensionality reduction transforms (PCA and SVD).

# get modeling pipelines to evaluate def get_pipelines(model): pipelines = list() # normalize p = Pipeline([('s',MinMaxScaler()), ('m',model)]) pipelines.append(('norm', p)) # standardize p = Pipeline([('s',StandardScaler()), ('m',model)]) pipelines.append(('std', p)) # quantile p = Pipeline([('s',QuantileTransformer(n_quantiles=100, output_distribution='normal')), ('m',model)]) pipelines.append(('quan', p)) # discretize p = Pipeline([('s',KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform')), ('m',model)]) pipelines.append(('kbins', p)) # pca p = Pipeline([('s',PCA(n_components=7)), ('m',model)]) pipelines.append(('pca', p)) # svd p = Pipeline([('s',TruncatedSVD(n_components=7)), ('m',model)]) pipelines.append(('svd', p)) return pipelines

We can then call this function to get the list of transforms, then enumerate each, evaluating it and reporting the performance along the way.

... # get the modeling pipelines pipelines = get_pipelines(model) # evaluate each pipeline results, names = list(), list() for name, pipeline in pipelines: # evaluate scores = evaluate_model(X, y, pipeline) # summarize print('>%s: %.3f (%.3f)' % (name, mean(scores), std(scores))) # store results.append(scores) names.append(name)

At the end of the run, we can create a box and whisker plot for each set of scores and compare the distributions of results visually.

... # plot the result pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Tying this together, the complete example of grid searching data preparation techniques on the wine classification dataset is listed below.

# compare data preparation methods for the wine classification dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression from sklearn.pipeline import Pipeline from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import MinMaxScaler from sklearn.preprocessing import StandardScaler from sklearn.preprocessing import QuantileTransformer from sklearn.preprocessing import KBinsDiscretizer from sklearn.decomposition import PCA from sklearn.decomposition import TruncatedSVD from matplotlib import pyplot # prepare the dataset def load_dataset(): # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/wine.csv' df = read_csv(url, header=None) data = df.values X, y = data[:, :-1], data[:, -1] # minimally prepare dataset X = X.astype('float') y = LabelEncoder().fit_transform(y.astype('str')) return X, y # evaluate a model def evaluate_model(X, y, model): # define the cross-validation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # get modeling pipelines to evaluate def get_pipelines(model): pipelines = list() # normalize p = Pipeline([('s',MinMaxScaler()), ('m',model)]) pipelines.append(('norm', p)) # standardize p = Pipeline([('s',StandardScaler()), ('m',model)]) pipelines.append(('std', p)) # quantile p = Pipeline([('s',QuantileTransformer(n_quantiles=100, output_distribution='normal')), ('m',model)]) pipelines.append(('quan', p)) # discretize p = Pipeline([('s',KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform')), ('m',model)]) pipelines.append(('kbins', p)) # pca p = Pipeline([('s',PCA(n_components=7)), ('m',model)]) pipelines.append(('pca', p)) # svd p = Pipeline([('s',TruncatedSVD(n_components=7)), ('m',model)]) pipelines.append(('svd', p)) return pipelines # get the dataset X, y = load_dataset() # define the model model = LogisticRegression(solver='liblinear') # get the modeling pipelines pipelines = get_pipelines(model) # evaluate each pipeline results, names = list(), list() for name, pipeline in pipelines: # evaluate scores = evaluate_model(X, y, pipeline) # summarize print('>%s: %.3f (%.3f)' % (name, mean(scores), std(scores))) # store results.append(scores) names.append(name) # plot the result pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example evaluates the performance of each pipeline and reports the mean and standard deviation classification accuracy.

Your results may vary given the stochastic nature of the learning algorithm, the evaluation procedure, and differences in precision across machines. Try running the example a few times.

In this case, we can see that standardizing the input variables and using a quantile transform both achieves the best result with a classification accuracy of about 98.7 percent, an improvement over the baseline with no data preparation that achieved a classification accuracy of 95.3 percent.

You can add your own modeling pipelines to the *get_pipelines()* function and compare their result.

**Can you get better results?**

Let me know in the comments below.

>norm: 0.976 (0.031) >std: 0.987 (0.023) >quan: 0.987 (0.023) >kbins: 0.968 (0.045) >pca: 0.963 (0.039) >svd: 0.953 (0.048)

A figure is created showing box and whisker plots that summarize the distribution of classification accuracy scores for each data preparation technique. We can see that the distribution of scores for the standardization and quantile transforms are compact and very similar and have an outlier. We can see that the spread of scores for the other transforms is larger and skewing down.

The results may suggest that standardizing the dataset is probably an important step in the data preparation and related transforms, such as the quantile transform, and perhaps even the power transform may offer benefits if combined with standardization by making one or more input variables more Gaussian.

We can also explore sequences of transforms to see if they can offer a lift in performance.

For example, we might want to apply RFE feature selection after the standardization transform to see if the same or better results can be used with fewer input variables (e.g. less complexity).

We might also want to see if a power transform preceded with a data scaling transform can achieve good performance on the dataset as we believe it could given the success of the quantile transform.

The updated *get_pipelines()* function with sequences of transforms is provided below.

# get modeling pipelines to evaluate def get_pipelines(model): pipelines = list() # standardize p = Pipeline([('s',StandardScaler()), ('r', RFE(estimator=LogisticRegression(solver='liblinear'), n_features_to_select=10)), ('m',model)]) pipelines.append(('std', p)) # scale and power p = Pipeline([('s',MinMaxScaler((1,2))), ('p', PowerTransformer()), ('m',model)]) pipelines.append(('power', p)) return pipelines

Tying this together, the complete example is listed below.

# compare sequences of data preparation methods for the wine classification dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression from sklearn.pipeline import Pipeline from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import MinMaxScaler from sklearn.preprocessing import StandardScaler from sklearn.preprocessing import QuantileTransformer from sklearn.preprocessing import PowerTransformer from sklearn.preprocessing import KBinsDiscretizer from sklearn.decomposition import PCA from sklearn.decomposition import TruncatedSVD from sklearn.feature_selection import RFE from matplotlib import pyplot # prepare the dataset def load_dataset(): # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/wine.csv' df = read_csv(url, header=None) data = df.values X, y = data[:, :-1], data[:, -1] # minimally prepare dataset X = X.astype('float') y = LabelEncoder().fit_transform(y.astype('str')) return X, y # evaluate a model def evaluate_model(X, y, model): # define the cross-validation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # get modeling pipelines to evaluate def get_pipelines(model): pipelines = list() # standardize p = Pipeline([('s',StandardScaler()), ('r', RFE(estimator=LogisticRegression(solver='liblinear'), n_features_to_select=10)), ('m',model)]) pipelines.append(('std', p)) # scale and power p = Pipeline([('s',MinMaxScaler((1,2))), ('p', PowerTransformer()), ('m',model)]) pipelines.append(('power', p)) return pipelines # get the dataset X, y = load_dataset() # define the model model = LogisticRegression(solver='liblinear') # get the modeling pipelines pipelines = get_pipelines(model) # evaluate each pipeline results, names = list(), list() for name, pipeline in pipelines: # evaluate scores = evaluate_model(X, y, pipeline) # summarize print('>%s: %.3f (%.3f)' % (name, mean(scores), std(scores))) # store results.append(scores) names.append(name) # plot the result pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example evaluates the performance of each pipeline and reports the mean and standard deviation classification accuracy.

Your results may vary given the stochastic nature of the learning algorithm, the evaluation procedure, and differences in precision across machines. Try running the example a few times.

In this case, we can see that the standardization with feature selection offers an additional lift in accuracy from 98.7 percent to 98.9 percent, although the data scaling and power transform do not offer any additional benefit over the quantile transform.

>std: 0.989 (0.022) >power: 0.987 (0.023)

A figure is created showing box and whisker plots that summarize the distribution of classification accuracy scores for each data preparation technique.

We can see that the distribution of results for both pipelines of transforms is compact with very little spread other than outlier.

This section provides more resources on the topic if you are looking to go deeper.

In this tutorial, you discovered how to use a grid search approach for data preparation with tabular data.

Specifically, you learned:

- Grid search provides an alternative approach to data preparation for tabular data, where transforms are tried as hyperparameters of the modeling pipeline.
- How to use the grid search method for data preparation to improve model performance over a baseline for a standard classification dataset.
- How to grid search sequences of data preparation methods to further improve model performance.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post How to Grid Search Data Preparation Techniques appeared first on Machine Learning Mastery.

]]>The post Framework for Data Preparation Techniques in Machine Learning appeared first on Machine Learning Mastery.

]]>In some cases, the distribution of the data or the requirements of a machine learning model may suggest the data preparation needed, although this is rarely the case given the complexity and high-dimensionality of the data, the ever-increasing parade of new machine learning algorithms and limited, although human, limitations of the practitioner.

Instead, data preparation can be treated as another hyperparameter to tune as part of the modeling pipeline. This raises the question of how to know what data preparation methods to consider in the search, which can feel overwhelming to experts and beginners alike.

The solution is to think about the vast field of data preparation in a structured way and systematically evaluate data preparation techniques based on their effect on the raw data.

In this tutorial, you will discover a framework that provides a structured approach to both thinking about and grouping data preparation techniques for predictive modeling with structured data.

After completing this tutorial, you will know:

- The challenge and overwhelm of framing data preparation as yet an additional hyperparameter to tune in the machine learning modeling pipeline.
- A framework that defines five groups of data preparation techniques to consider.
- Examples of data preparation techniques that belong to each group that can be evaluated on your predictive modeling project.

Let’s get started.

This tutorial is divided into three parts; they are:

- Challenge of Data Preparation
- Framework for Data Preparation
- Data Preparation Techniques

Data preparation refers to transforming raw data into a form that is better suited to predictive modeling.

This may be required because the data itself contains mistakes or errors. It may also be because the chosen algorithms have expectations regarding the type and distribution of the data.

To make the task of data preparation even more challenging, it is also common that the data preparation required to get the best performance from a predictive model may not be obvious and may bend or violate the expectations of the model that is being used.

As such, it is common to treat the choice and configuration of data preparation applied to the raw data as yet another hyperparameter of the modeling pipeline to be tuned.

This framing of data preparation is very effective in practice, as it allows you to use automatic search techniques like grid search and random search to discover unintuitive data preparation steps that result in skillful predictive models.

This framing of data preparation can also feel overwhelming to beginners given the large number and variety of data preparation techniques.

The solution to this overwhelm is to think about data preparation techniques in a systematic way.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Effective data preparation requires that the data preparation techniques available are organized and considered in a structured and systematic way.

This allows you to ensure that approach techniques are explored for your dataset and that potentially effective techniques are not skipped or ignored.

This can be achieved using a framework to organize data preparation techniques that consider their effect on the raw dataset.

For example, structured machine learning data, such as data we might store in a CSV file for classification and regression, consists of rows, columns, and values. We might consider data preparation techniques that operate at each of these levels.

- Data Preparation for Rows
- Data Preparation for Columns
- Data Preparation for Values

Data preparation for rows may be techniques that add or remove rows of data from the dataset. Similarly, data preparation for columns may be techniques that add or remove columns (features or variables) from the dataset. Whereas data preparation for values may be techniques that change the values in the dataset, often for a given column.

There is one more type of data preparation that does not neatly fit into this structure, and that is dimensionality reduction techniques. These techniques change the columns and the values at the same time, e.g. projecting the data into a lower-dimensional space.

- Data Preparation for Columns + Values

This raises the question of techniques that might apply to rows and values at the same time. This might include data preparation that consolidates rows of data in some way.

- Data Preparation for Rows + Values

We can summarize this framework and some high-level groups of data preparation methods in the following image.

Now that we have a framework for thinking about data preparation based on their effect on the data, let’s look at examples of techniques that fit into each group.

This section explores the five high-level groups of data preparation techniques defined in the previous section and suggests specific techniques that may fall within each group.

**Did I miss one of your preferred or favorite data preparation techniques?**

Let me know in the comments below.

This group is for data preparation techniques that add or remove rows of data.

In machine learning, rows are often referred to as samples, examples, or instances.

These techniques are often used to augment a limited training dataset or to remove errors or ambiguity from the dataset.

The main class of techniques that come to mind are data preparation techniques that are often used for imbalanced classification.

This includes techniques such as SMOTE that create synthetic rows of training data for under-represented classes and random undersampling that remove examples for over-represented classes.

For more on SMOTE data sampling, see the tutorial:

It also includes more advanced combined over- and undersampling techniques that attempt to identify and remove ambiguous examples along the decision boundary of a classification problem and remove them or change their class label.

For more on these types of data preparation, see the tutorial:

This class of data preparation techniques also includes algorithms for identifying and removing outliers from the data. These are rows of data that may be far from the center of probability mass in the dataset and, in turn, may be unrepresentative of the data from the domain.

For more on outlier detection and removal methods, see the tutorial:

This group is for data preparation techniques that add or remove columns of data.

In machine learning, columns are often referred to as variables or features.

These techniques are often required to either reduce the complexity (dimensionality) of a prediction problem or to unpack compound input variables or complex interactions between features.

The main class of techniques that come to mind are feature selection techniques.

This includes techniques that use statistics to score the relevance of input variables to the target variable based on the data type of each.

For more on these types of data preparation techniques, see the tutorial:

This also includes feature selection techniques that systematically test the impact of different combinations of input variables on the predictive skill of a machine learning model.

For more on these types of methods, see the tutorial:

Related are techniques that use a model to score the importance of input features based on their use by a predictive model, referred to as feature importance methods. These methods are often used for data interpretation, although they can also be used for feature selection.

For more on these types of methods, see the tutorial:

This group of methods also brings to mind techniques for creating or deriving new columns of data, new features. These are often referred to as feature engineering, although sometimes the whole field of data preparation is referred to as feature engineering.

For example, new features that represent values raised to exponents or multiplicative combinations of features can be created and added to the dataset as new columns.

For more on these types of data preparation techniques, see the tutorial:

This might also include data transforms that change a variable type, such as creating dummy variables for a categorical variable, often referred to as a one-hot encoding.

For more on these types of data preparation techniques, see the tutorial:

This group is for data preparation techniques that change the raw values in the data.

These techniques are often required to meet the expectations or requirements of specific machine learning algorithms.

The main class of techniques that come to mind is data transforms that change the scale or distribution of input variables.

For example, data transforms such as standardization and normalization change the scale of numeric input variables. Data transforms like ordinal encoding change the type of categorical input variables.

There are also many data transforms for changing the distribution of input variables.

For example, discretization or binning change the distribution of numerical input variables into categorical variables with an ordinal ranking.

For more on this type of data transform, see the tutorial:

The power transform can be used to change the distribution of data to remove a skew and make the distribution more normal (Gaussian).

For more on this method, see the tutorial:

The quantile transform is a flexible type of data preparation technique that can map a numerical input variable or to different types of distributions such as normal or Gaussian.

You can learn more about this data preparation technique here:

Another type of data preparation technique that belongs to this group are methods that systematically change values in the dataset.

This includes techniques that identify and replace missing values, often referred to as missing value imputation. This can be achieved using statistical methods or more advanced model-based methods.

For more on these methods, see the tutorial:

All of the methods discussed could also be considered feature engineering methods (e.g. fitting into the previously discussed group of data preparation methods) if the results of the transforms are appended to the raw data as new columns.

This group is for data preparation techniques that change both the number of columns and the values in the data.

The main class of techniques that this brings to mind are dimensionality reduction techniques that specifically reduce the number of columns and the scale and distribution of numerical input variables.

This includes matrix factorization methods used in linear algebra as well as manifold learning algorithms used in high-dimensional statistics.

For more information on these techniques, see the tutorial:

Although these techniques are designed to create projections of rows in a lower-dimensional space, perhaps this also leaves the door open to techniques that do the inverse. That is, use all or a subset of the input variables to create a projection into a higher-dimensional space, perhaps decompiling complex non-linear relationships.

Perhaps polynomial transforms where the results replace the raw dataset would fit into this class of data preparation methods.

**Do you know of other methods that fit into this group?**

Let me know in the comments below.

This group is for data preparation techniques that change both the number of rows and the values in the data.

I have not explicitly considered data transforms of this type before, but it falls out of the framework as defined.

A group of methods that come to mind are clustering algorithms where all or subsets of rows of data in the dataset are replaced with data samples at the cluster centers, referred to as cluster centroids.

Related might be replacing rows with exemplars (aggregates of rows) taken from specific machine learning algorithms, such as support vectors from a support vector machine, or the codebook vectors taken from a learning vector quantization.

Naturally, these aggregate rows are simply added to the dataset rather than replacing rows, then they would naturally fit into the “*Data Preparation for Rows*” group described above.

**Do you know of other methods that fit into this group?**

Let me know in the comments below.

This section provides more resources on the topic if you are looking to go deeper.

In this tutorial, you discovered a framework for systematically grouping data preparation techniques based on their effect on raw data.

Specifically, you learned:

- The challenge and overwhelm of framing data preparation as yet an additional hyperparameter to tune in the machine learning modeling pipeline.
- A framework that defines five groups of data preparation techniques to consider.
- Examples of data preparation techniques that belong to each group that can be evaluated on your predictive modeling project.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Framework for Data Preparation Techniques in Machine Learning appeared first on Machine Learning Mastery.

]]>The post 6 Dimensionality Reduction Algorithms With Python appeared first on Machine Learning Mastery.

]]>Nevertheless, it can be used as a data transform pre-processing step for machine learning algorithms on classification and regression predictive modeling datasets with supervised learning algorithms.

There are many dimensionality reduction algorithms to choose from and no single best algorithm for all cases. Instead, it is a good idea to explore a range of dimensionality reduction algorithms and different configurations for each algorithm.

In this tutorial, you will discover how to fit and evaluate top dimensionality reduction algorithms in Python.

After completing this tutorial, you will know:

- Dimensionality reduction seeks a lower-dimensional representation of numerical input data that preserves the salient relationships in the data.
- There are many different dimensionality reduction algorithms and no single best method for all datasets.
- How to implement, fit, and evaluate top dimensionality reduction in Python with the scikit-learn machine learning library.

Let’s get started.

This tutorial is divided into three parts; they are:

- Dimensionality Reduction
- Dimensionality Reduction Algorithms
- Examples of Dimensionality Reduction
- Scikit-Learn Library Installation
- Classification Dataset
- Principal Component Analysis
- Singular Value Decomposition
- Linear Discriminant Analysis
- Isomap Embedding
- Locally Linear Embedding
- Modified Locally Linear Embedding

Dimensionality reduction refers to techniques for reducing the number of input variables in training data.

When dealing with high dimensional data, it is often useful to reduce the dimensionality by projecting the data to a lower dimensional subspace which captures the “essence” of the data. This is called dimensionality reduction.

— Page 11, Machine Learning: A Probabilistic Perspective, 2012.

High-dimensionality might mean hundreds, thousands, or even millions of input variables.

Fewer input dimensions often means correspondingly fewer parameters or a simpler structure in the machine learning model, referred to as degrees of freedom. A model with too many degrees of freedom is likely to overfit the training dataset and may not perform well on new data.

It is desirable to have simple models that generalize well, and in turn, input data with few input variables. This is particularly true for linear models where the number of inputs and the degrees of freedom of the model are often closely related.

Dimensionality reduction is a data preparation technique performed on data prior to modeling. It might be performed after data cleaning and data scaling and before training a predictive model.

… dimensionality reduction yields a more compact, more easily interpretable representation of the target concept, focusing the user’s attention on the most relevant variables.

— Page 289, Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, 2016.

As such, any dimensionality reduction performed on training data must also be performed on new data, such as a test dataset, validation dataset, and data when making a prediction with the final model.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

There are many algorithms that can be used for dimensionality reduction.

Two main classes of methods are those drawn from linear algebra and those drawn from manifold learning.

Matrix factorization methods drawn from the field of linear algebra can be used for dimensionality.

For more on matrix factorization, see the tutorial:

Some of the more popular methods include:

- Principal Components Analysis
- Singular Value Decomposition
- Non-Negative Matrix Factorization

Manifold learning methods seek a lower-dimensional projection of high dimensional input that captures the salient properties of the input data.

Some of the more popular methods include:

- Isomap Embedding
- Locally Linear Embedding
- Multidimensional Scaling
- Spectral Embedding
- t-distributed Stochastic Neighbor Embedding

Each algorithm offers a different approach to the challenge of discovering natural relationships in data at lower dimensions.

There is no best dimensionality reduction algorithm, and no easy way to find the best algorithm for your data without using controlled experiments.

In this tutorial, we will review how to use each subset of these popular dimensionality reduction algorithms from the scikit-learn library.

The examples will provide the basis for you to copy-paste the examples and test the methods on your own data.

We will not dive into the theory behind how the algorithms work or compare them directly. For a good starting point on this topic, see:

Let’s dive in.

In this section, we will review how to use popular dimensionality reduction algorithms in scikit-learn.

This includes an example of using the dimensionality reduction technique as a data transform in a modeling pipeline and evaluating a model fit on the data.

The examples are designed for you to copy-paste into your own project and apply the methods to your own data. There are some algorithms available in the scikit-learn library that are not demonstrated because they cannot be used as a data transform directly given the nature of the algorithm.

As such, we will use a synthetic classification dataset in each example.

First, let’s install the library.

Don’t skip this step as you will need to ensure you have the latest version installed.

You can install the scikit-learn library using the pip Python installer, as follows:

sudo pip install scikit-learn

For additional installation instructions specific to your platform, see:

Next, let’s confirm that the library is installed and you are using a modern version.

Run the following script to print the library version number.

# check scikit-learn version import sklearn print(sklearn.__version__)

Running the example, you should see the following version number or higher.

0.23.0

We will use the make_classification() function to create a test binary classification dataset.

The dataset will have 1,000 examples with 20 input features, 10 of which are informative and 10 of which are redundant. This provides an opportunity for each technique to identify and remove redundant input features.

The fixed random seed for the pseudorandom number generator ensures we generate the same synthetic dataset each time the code runs.

An example of creating and summarizing the synthetic classification dataset is listed below.

# synthetic classification dataset from sklearn.datasets import make_classification # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=10, random_state=7) # summarize the dataset print(X.shape, y.shape)

Running the example creates the dataset and reports the number of rows and columns matching our expectations.

(1000, 20) (1000,)

It is a binary classification task and we will evaluate a LogisticRegression model after each dimensionality reduction transform.

The model will be evaluated using the gold standard of repeated stratified 10-fold cross-validation. The mean and standard deviation classification accuracy across all folds and repeats will be reported.

The example below evaluates the model on the raw dataset as a point of comparison.

# evaluate logistic regression model on raw data from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.linear_model import LogisticRegression # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=10, random_state=7) # define the model model = LogisticRegression() # evaluate model cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # report performance print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example evaluates the logistic regression on the raw dataset with all 20 columns, achieving a classification accuracy of about 82.4 percent.

A successful dimensionality reduction transform on this data should result in a model that has better accuracy than this baseline, although this may not be possible with all techniques.

Note: we are not trying to “*solve*” this dataset, just provide working examples that you can use as a starting point.

Accuracy: 0.824 (0.034)

Next, we can start looking at examples of dimensionality reduction algorithms applied to this dataset.

I have made some minimal attempts to tune each method to the dataset. Each dimensionality reduction method will be configured to reduce the 20 input columns to 10 where possible.

We will use a Pipeline to combine the data transform and model into an atomic unit that can be evaluated using the cross-validation procedure; for example:

... # define the pipeline steps = [('pca', PCA(n_components=10)), ('m', LogisticRegression())] model = Pipeline(steps=steps)

Let’s get started.

**Can you get a better result for one of the algorithms?**

Let me know in the comments below.

Principal Component Analysis, or PCA, might be the most popular technique for dimensionality reduction with dense data (few zero values).

For more on how PCA works, see the tutorial:

The scikit-learn library provides the PCA class implementation of Principal Component Analysis that can be used as a dimensionality reduction data transform. The “*n_components*” argument can be set to configure the number of desired dimensions in the output of the transform.

The complete example of evaluating a model with PCA dimensionality reduction is listed below.

# evaluate pca with logistic regression algorithm for classification from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.pipeline import Pipeline from sklearn.decomposition import PCA from sklearn.linear_model import LogisticRegression # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=10, random_state=7) # define the pipeline steps = [('pca', PCA(n_components=10)), ('m', LogisticRegression())] model = Pipeline(steps=steps) # evaluate model cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # report performance print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example evaluates the modeling pipeline with dimensionality reduction and a logistic regression predictive model.

In this case, we don’t see any lift in model performance in using the PCA transform.

Accuracy: 0.824 (0.034)

Singular Value Decomposition, or SVD, is one of the most popular techniques for dimensionality reduction for sparse data (data with many zero values).

For more on how SVD works, see the tutorial:

The scikit-learn library provides the TruncatedSVD class implementation of Singular Value Decomposition that can be used as a dimensionality reduction data transform. The “*n_components*” argument can be set to configure the number of desired dimensions in the output of the transform.

The complete example of evaluating a model with SVD dimensionality reduction is listed below.

# evaluate svd with logistic regression algorithm for classification from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.pipeline import Pipeline from sklearn.decomposition import TruncatedSVD from sklearn.linear_model import LogisticRegression # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=10, random_state=7) # define the pipeline steps = [('svd', TruncatedSVD(n_components=10)), ('m', LogisticRegression())] model = Pipeline(steps=steps) # evaluate model cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # report performance print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example evaluates the modeling pipeline with dimensionality reduction and a logistic regression predictive model.

In this case, we don’t see any lift in model performance in using the SVD transform.

Accuracy: 0.824 (0.034)

Linear Discriminant Analysis, or LDA, is a multi-class classification algorithm that can be used for dimensionality reduction.

The number of dimensions for the projection is limited to 1 and C-1, where C is the number of classes. In this case, our dataset is a binary classification problem (two classes), limiting the number of dimensions to 1.

For more on LDA for dimensionality reduction, see the tutorial:

The scikit-learn library provides the LinearDiscriminantAnalysis class implementation of Linear Discriminant Analysis that can be used as a dimensionality reduction data transform. The “*n_components*” argument can be set to configure the number of desired dimensions in the output of the transform.

The complete example of evaluating a model with LDA dimensionality reduction is listed below.

# evaluate lda with logistic regression algorithm for classification from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.pipeline import Pipeline from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.linear_model import LogisticRegression # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=10, random_state=7) # define the pipeline steps = [('lda', LinearDiscriminantAnalysis(n_components=1)), ('m', LogisticRegression())] model = Pipeline(steps=steps) # evaluate model cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # report performance print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example evaluates the modeling pipeline with dimensionality reduction and a logistic regression predictive model.

In this case, we can see a slight lift in performance as compared to the baseline fit on the raw data.

Accuracy: 0.825 (0.034)

Isomap Embedding, or Isomap, creates an embedding of the dataset and attempts to preserve the relationships in the dataset.

The scikit-learn library provides the Isomap class implementation of Isomap Embedding that can be used as a dimensionality reduction data transform. The “*n_components*” argument can be set to configure the number of desired dimensions in the output of the transform.

The complete example of evaluating a model with SVD dimensionality reduction is listed below.

# evaluate isomap with logistic regression algorithm for classification from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.pipeline import Pipeline from sklearn.manifold import Isomap from sklearn.linear_model import LogisticRegression # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=10, random_state=7) # define the pipeline steps = [('iso', Isomap(n_components=10)), ('m', LogisticRegression())] model = Pipeline(steps=steps) # evaluate model cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # report performance print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

In this case, we can see a lift in performance with the Isomap data transform as compared to the baseline fit on the raw data.

Accuracy: 0.888 (0.029)

Locally Linear Embedding, or LLE, creates an embedding of the dataset and attempts to preserve the relationships between neighborhoods in the dataset.

The scikit-learn library provides the LocallyLinearEmbedding class implementation of Locally Linear Embedding that can be used as a dimensionality reduction data transform. The “*n_components*” argument can be set to configure the number of desired dimensions in the output of the transform

The complete example of evaluating a model with LLE dimensionality reduction is listed below.

# evaluate lle and logistic regression for classification from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.pipeline import Pipeline from sklearn.manifold import LocallyLinearEmbedding from sklearn.linear_model import LogisticRegression # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=10, random_state=7) # define the pipeline steps = [('lle', LocallyLinearEmbedding(n_components=10)), ('m', LogisticRegression())] model = Pipeline(steps=steps) # evaluate model cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # report performance print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

In this case, we can see a lift in performance with the LLE data transform as compared to the baseline fit on the raw data.

Accuracy: 0.886 (0.028)

Modified Locally Linear Embedding, or Modified LLE, is an extension of Locally Linear Embedding that creates multiple weighting vectors for each neighborhood.

The scikit-learn library provides the LocallyLinearEmbedding class implementation of Modified Locally Linear Embedding that can be used as a dimensionality reduction data transform. The “*method*” argument must be set to ‘modified’ and the “*n_components*” argument can be set to configure the number of desired dimensions in the output of the transform which must be less than the “*n_neighbors*” argument.

The complete example of evaluating a model with Modified LLE dimensionality reduction is listed below.

# evaluate modified lle and logistic regression for classification from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.pipeline import Pipeline from sklearn.manifold import LocallyLinearEmbedding from sklearn.linear_model import LogisticRegression # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=10, random_state=7) # define the pipeline steps = [('lle', LocallyLinearEmbedding(n_components=5, method='modified', n_neighbors=10)), ('m', LogisticRegression())] model = Pipeline(steps=steps) # evaluate model cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # report performance print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

In this case, we can see a lift in performance with the modified LLE data transform as compared to the baseline fit on the raw data.

Accuracy: 0.846 (0.036)

This section provides more resources on the topic if you are looking to go deeper.

- Introduction to Dimensionality Reduction for Machine Learning
- Principal Component Analysis for Dimensionality Reduction in Python
- Singular Value Decomposition for Dimensionality Reduction in Python
- Linear Discriminant Analysis for Dimensionality Reduction in Python

In this tutorial, you discovered how to fit and evaluate top dimensionality reduction algorithms in Python.

Specifically, you learned:

- Dimensionality reduction seeks a lower-dimensional representation of numerical input data that preserves the salient relationships in the data.
- There are many different dimensionality reduction algorithms and no single best method for all datasets.
- How to implement, fit, and evaluate top dimensionality reduction in Python with the scikit-learn machine learning library.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post 6 Dimensionality Reduction Algorithms With Python appeared first on Machine Learning Mastery.

]]>The post 4 Automatic Outlier Detection Algorithms in Python appeared first on Machine Learning Mastery.

]]>Identifying and **removing outliers** is challenging with simple statistical methods for most machine learning datasets given the large number of input variables. Instead, automatic outlier detection methods can be used in the modeling pipeline and compared, just like other data preparation transforms that may be applied to the dataset.

In this tutorial, you will discover how to use automatic outlier detection and removal to improve machine learning predictive modeling performance.

After completing this tutorial, you will know:

- Automatic outlier detection models provide an alternative to statistical techniques with a larger number of input variables with complex and unknown inter-relationships.
- How to correctly apply automatic outlier detection and removal to the training dataset only to avoid data leakage.
- How to evaluate and compare predictive modeling pipelines with outliers removed from the training dataset.

Let’s get started.

This tutorial is divided into three parts; they are:

- Outlier Detection and Removal
- Dataset and Performance Baseline
- House Price Regression Dataset
- Baseline Model Performance

- Automatic Outlier Detection
- Isolation Forest
- Minimum Covariance Determinant
- Local Outlier Factor
- One-Class SVM

Outliers are observations in a dataset that don’t fit in some way.

Perhaps the most common or familiar type of outlier is the observations that are far from the rest of the observations or the center of mass of observations.

This is easy to understand when we have one or two variables and we can visualize the data as a histogram or scatter plot, although it becomes very challenging when we have many input variables defining a high-dimensional input feature space.

In this case, simple statistical methods for identifying outliers can break down, such as methods that use standard deviations or the interquartile range.

It can be important to identify and remove outliers from data when training machine learning algorithms for predictive modeling.

Outliers can skew statistical measures and data distributions, providing a misleading representation of the underlying data and relationships. Removing outliers from training data prior to modeling can result in a better fit of the data and, in turn, more skillful predictions.

Thankfully, there are a variety of automatic model-based methods for identifying outliers in input data. Importantly, each method approaches the definition of an outlier is slightly different ways, providing alternate approaches to preparing a training dataset that can be evaluated and compared, just like any other data preparation step in a modeling pipeline.

Before we dive into automatic outlier detection methods, let’s first select a standard machine learning dataset that we can use as the basis for our investigation.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

In this section, we will first select a standard machine learning dataset and establish a baseline in performance on this dataset.

This will provide the context for exploring the outlier identification and removal method of data preparation in the next section.

We will use the house price regression dataset.

This dataset has 13 input variables that describe the properties of the house and suburb and requires the prediction of the median value of houses in the suburb in thousands of dollars.

You can learn more about the dataset here:

No need to download the dataset as we will download it automatically as part of our worked examples.

Open the dataset and review the raw data. The first few rows of data are listed below.

We can see that it is a regression predictive modeling problem with numerical input variables, each of which has different scales.

0.00632,18.00,2.310,0,0.5380,6.5750,65.20,4.0900,1,296.0,15.30,396.90,4.98,24.00 0.02731,0.00,7.070,0,0.4690,6.4210,78.90,4.9671,2,242.0,17.80,396.90,9.14,21.60 0.02729,0.00,7.070,0,0.4690,7.1850,61.10,4.9671,2,242.0,17.80,392.83,4.03,34.70 0.03237,0.00,2.180,0,0.4580,6.9980,45.80,6.0622,3,222.0,18.70,394.63,2.94,33.40 0.06905,0.00,2.180,0,0.4580,7.1470,54.20,6.0622,3,222.0,18.70,396.90,5.33,36.20 ...

The dataset has many numerical input variables that have unknown and complex relationships. We don’t know that outliers exist in this dataset, although we may guess that some outliers may be present.

The example below loads the dataset and splits it into the input and output columns, splits it into train and test datasets, then summarizes the shapes of the data arrays.

# load and summarize the dataset from pandas import read_csv from sklearn.model_selection import train_test_split # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' df = read_csv(url, header=None) # retrieve the array data = df.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # summarize the shape of the dataset print(X.shape, y.shape) # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # summarize the shape of the train and test sets print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

Running the example, we can see that the dataset was loaded correctly and that there are 506 rows of data with 13 input variables and a single target variable.

The dataset is split into train and test sets with 339 rows used for model training and 167 for model evaluation.

(506, 13) (506,) (339, 13) (167, 13) (339,) (167,)

Next, let’s evaluate a model on this dataset and establish a baseline in performance.

It is a regression predictive modeling problem, meaning that we will be predicting a numeric value. All input variables are also numeric.

In this case, we will fit a linear regression algorithm and evaluate model performance by training the model on the test dataset and making a prediction on the test data and evaluate the predictions using the mean absolute error (MAE).

The complete example of evaluating a linear regression model on the dataset is listed below.

# evaluate model on the raw dataset from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_absolute_error # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' df = read_csv(url, header=None) # retrieve the array data = df.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # fit the model model = LinearRegression() model.fit(X_train, y_train) # evaluate the model yhat = model.predict(X_test) # evaluate predictions mae = mean_absolute_error(y_test, yhat) print('MAE: %.3f' % mae)

Running the example fits and evaluates the model, then reports the MAE.

Your specific results may differ given the stochastic nature of the learning algorithm, the evaluation procedure, and/or differences in precision across systems. Try running the example a few times.

In this case, we can see that the model achieved a MAE of about 3.417. This provides a baseline in performance to which we can compare different outlier identification and removal procedures.

MAE: 3.417

Next, we can try removing outliers from the training dataset.

The scikit-learn library provides a number of built-in automatic methods for identifying outliers in data.

In this section, we will review four methods and compare their performance on the house price dataset.

Each method will be defined, then fit on the training dataset. The fit model will then predict which examples in the training dataset are outliers and which are not (so-called inliers). The outliers will then be removed from the training dataset, then the model will be fit on the remaining examples and evaluated on the entire test dataset.

It would be invalid to fit the outlier detection method on the entire training dataset as this would result in data leakage. That is, the model would have access to data (or information about the data) in the test set not used to train the model. This may result in an optimistic estimate of model performance.

We could attempt to detect outliers on “*new data*” such as the test set prior to making a prediction, but then what do we do if outliers are detected?

One approach might be to return a “*None*” indicating that the model is unable to make a prediction on those outlier cases. This might be an interesting extension to explore that may be appropriate for your project.

Isolation Forest, or iForest for short, is a tree-based anomaly detection algorithm.

It is based on modeling the normal data in such a way as to isolate anomalies that are both few in number and different in the feature space.

… our proposed method takes advantage of two anomalies’ quantitative properties: i) they are the minority consisting of fewer instances and ii) they have attribute-values that are very different from those of normal instances.

— Isolation Forest, 2008.

The scikit-learn library provides an implementation of Isolation Forest in the IsolationForest class.

Perhaps the most important hyperparameter in the model is the “*contamination*” argument, which is used to help estimate the number of outliers in the dataset. This is a value between 0.0 and 0.5 and by default is set to 0.1.

... # identify outliers in the training dataset iso = IsolationForest(contamination=0.1) yhat = iso.fit_predict(X_train)

Once identified, we can remove the outliers from the training dataset.

... # select all rows that are not outliers mask = yhat != -1 X_train, y_train = X_train[mask, :], y_train[mask]

Tying this together, the complete example of evaluating the linear model on the housing dataset with outliers identified and removed with isolation forest is listed below.

# evaluate model performance with outliers removed using isolation forest from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.ensemble import IsolationForest from sklearn.metrics import mean_absolute_error # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' df = read_csv(url, header=None) # retrieve the array data = df.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # summarize the shape of the training dataset print(X_train.shape, y_train.shape) # identify outliers in the training dataset iso = IsolationForest(contamination=0.1) yhat = iso.fit_predict(X_train) # select all rows that are not outliers mask = yhat != -1 X_train, y_train = X_train[mask, :], y_train[mask] # summarize the shape of the updated training dataset print(X_train.shape, y_train.shape) # fit the model model = LinearRegression() model.fit(X_train, y_train) # evaluate the model yhat = model.predict(X_test) # evaluate predictions mae = mean_absolute_error(y_test, yhat) print('MAE: %.3f' % mae)

Running the example fits and evaluates the model, then reports the MAE.

Your specific results may differ given the stochastic nature of the learning algorithm, the evaluation procedure, and/or differences in precision across systems. Try running the example a few times.

In this case, we can see that that model identified and removed 34 outliers and achieved a MAE of about 3.189, an improvement over the baseline that achieved a score of about 3.417.

(339, 13) (339,) (305, 13) (305,) MAE: 3.189

If the input variables have a Gaussian distribution, then simple statistical methods can be used to detect outliers.

For example, if the dataset has two input variables and both are Gaussian, then the feature space forms a multi-dimensional Gaussian and knowledge of this distribution can be used to identify values far from the distribution.

This approach can be generalized by defining a hypersphere (ellipsoid) that covers the normal data, and data that falls outside this shape is considered an outlier. An efficient implementation of this technique for multivariate data is known as the Minimum Covariance Determinant, or MCD for short.

The Minimum Covariance Determinant (MCD) method is a highly robust estimator of multivariate location and scatter, for which a fast algorithm is available. […] It also serves as a convenient and efficient tool for outlier detection.

— Minimum Covariance Determinant and Extensions, 2017.

The scikit-learn library provides access to this method via the EllipticEnvelope class.

It provides the “*contamination*” argument that defines the expected ratio of outliers to be observed in practice. In this case, we will set it to a value of 0.01, found with a little trial and error.

... # identify outliers in the training dataset ee = EllipticEnvelope(contamination=0.01) yhat = ee.fit_predict(X_train)

Once identified, the outliers can be removed from the training dataset as we did in the prior example.

Tying this together, the complete example of identifying and removing outliers from the housing dataset using the elliptical envelope (minimum covariant determinant) method is listed below.

# evaluate model performance with outliers removed using elliptical envelope from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.covariance import EllipticEnvelope from sklearn.metrics import mean_absolute_error # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' df = read_csv(url, header=None) # retrieve the array data = df.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # summarize the shape of the training dataset print(X_train.shape, y_train.shape) # identify outliers in the training dataset ee = EllipticEnvelope(contamination=0.01) yhat = ee.fit_predict(X_train) # select all rows that are not outliers mask = yhat != -1 X_train, y_train = X_train[mask, :], y_train[mask] # summarize the shape of the updated training dataset print(X_train.shape, y_train.shape) # fit the model model = LinearRegression() model.fit(X_train, y_train) # evaluate the model yhat = model.predict(X_test) # evaluate predictions mae = mean_absolute_error(y_test, yhat) print('MAE: %.3f' % mae)

Running the example fits and evaluates the model, then reports the MAE.

Your specific results may differ given the stochastic nature of the learning algorithm, the evaluation procedure, and/or differences in precision across systems. Try running the example a few times.

In this case, we can see that the elliptical envelope method identified and removed only 4 outliers, resulting in a drop in MAE from 3.417 with the baseline to 3.388.

(339, 13) (339,) (335, 13) (335,) MAE: 3.388

A simple approach to identifying outliers is to locate those examples that are far from the other examples in the feature space.

This can work well for feature spaces with low dimensionality (few features), although it can become less reliable as the number of features is increased, referred to as the curse of dimensionality.

The local outlier factor, or LOF for short, is a technique that attempts to harness the idea of nearest neighbors for outlier detection. Each example is assigned a scoring of how isolated or how likely it is to be outliers based on the size of its local neighborhood. Those examples with the largest score are more likely to be outliers.

We introduce a local outlier (LOF) for each object in the dataset, indicating its degree of outlier-ness.

— LOF: Identifying Density-based Local Outliers, 2000.

The scikit-learn library provides an implementation of this approach in the LocalOutlierFactor class.

The model provides the “*contamination*” argument, that is the expected percentage of outliers in the dataset, be indicated and defaults to 0.1.

... # identify outliers in the training dataset lof = LocalOutlierFactor() yhat = lof.fit_predict(X_train)

Tying this together, the complete example of identifying and removing outliers from the housing dataset using the local outlier factor method is listed below.

# evaluate model performance with outliers removed using local outlier factor from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.neighbors import LocalOutlierFactor from sklearn.metrics import mean_absolute_error # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' df = read_csv(url, header=None) # retrieve the array data = df.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # summarize the shape of the training dataset print(X_train.shape, y_train.shape) # identify outliers in the training dataset lof = LocalOutlierFactor() yhat = lof.fit_predict(X_train) # select all rows that are not outliers mask = yhat != -1 X_train, y_train = X_train[mask, :], y_train[mask] # summarize the shape of the updated training dataset print(X_train.shape, y_train.shape) # fit the model model = LinearRegression() model.fit(X_train, y_train) # evaluate the model yhat = model.predict(X_test) # evaluate predictions mae = mean_absolute_error(y_test, yhat) print('MAE: %.3f' % mae)

Running the example fits and evaluates the model, then reports the MAE.

In this case, we can see that the local outlier factor method identified and removed 34 outliers, the same number as isolation forest, resulting in a drop in MAE from 3.417 with the baseline to 3.356. Better, but not as good as isolation forest, suggesting a different set of outliers were identified and removed.

(339, 13) (339,) (305, 13) (305,) MAE: 3.356

The support vector machine, or SVM, algorithm developed initially for binary classification can be used for one-class classification.

When modeling one class, the algorithm captures the density of the majority class and classifies examples on the extremes of the density function as outliers. This modification of SVM is referred to as One-Class SVM.

… an algorithm that computes a binary function that is supposed to capture regions in input space where the probability density lives (its support), that is, a function such that most of the data will live in the region where the function is nonzero.

— Estimating the Support of a High-Dimensional Distribution, 2001.

Although SVM is a classification algorithm and One-Class SVM is also a classification algorithm, it can be used to discover outliers in input data for both regression and classification datasets.

The scikit-learn library provides an implementation of one-class SVM in the OneClassSVM class.

The class provides the “*nu*” argument that specifies the approximate ratio of outliers in the dataset, which defaults to 0.1. In this case, we will set it to 0.01, found with a little trial and error.

... # identify outliers in the training dataset ee = OneClassSVM(nu=0.01) yhat = ee.fit_predict(X_train)

Tying this together, the complete example of identifying and removing outliers from the housing dataset using the one class SVM method is listed below.

# evaluate model performance with outliers removed using one class SVM from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.svm import OneClassSVM from sklearn.metrics import mean_absolute_error # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' df = read_csv(url, header=None) # retrieve the array data = df.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # summarize the shape of the training dataset print(X_train.shape, y_train.shape) # identify outliers in the training dataset ee = OneClassSVM(nu=0.01) yhat = ee.fit_predict(X_train) # select all rows that are not outliers mask = yhat != -1 X_train, y_train = X_train[mask, :], y_train[mask] # summarize the shape of the updated training dataset print(X_train.shape, y_train.shape) # fit the model model = LinearRegression() model.fit(X_train, y_train) # evaluate the model yhat = model.predict(X_test) # evaluate predictions mae = mean_absolute_error(y_test, yhat) print('MAE: %.3f' % mae)

Running the example fits and evaluates the model, then reports the MAE.

In this case, we can see that only three outliers were identified and removed and the model achieved a MAE of about 3.431, which is not better than the baseline model that achieved 3.417. Perhaps better performance can be achieved with more tuning.

(339, 13) (339,) (336, 13) (336,) MAE: 3.431

This section provides more resources on the topic if you are looking to go deeper.

- One-Class Classification Algorithms for Imbalanced Datasets
- How to Remove Outliers for Machine Learning

- Isolation Forest, 2008.
- Minimum Covariance Determinant and Extensions, 2017.
- LOF: Identifying Density-based Local Outliers, 2000.
- Estimating the Support of a High-Dimensional Distribution, 2001.

- Novelty and Outlier Detection, scikit-learn user guide.
- sklearn.covariance.EllipticEnvelope API.
- sklearn.svm.OneClassSVM API.
- sklearn.neighbors.LocalOutlierFactor API.
- sklearn.ensemble.IsolationForest API.

In this tutorial, you discovered how to use automatic outlier detection and removal to improve machine learning predictive modeling performance.

Specifically, you learned:

- Automatic outlier detection models provide an alternative to statistical techniques with a larger number of input variables with complex and unknown inter-relationships.
- How to correctly apply automatic outlier detection and removal to the training dataset only to avoid data leakage.
- How to evaluate and compare predictive modeling pipelines with outliers removed from the training dataset.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post 4 Automatic Outlier Detection Algorithms in Python appeared first on Machine Learning Mastery.

]]>The post How to Use Feature Extraction on Tabular Data for Machine Learning appeared first on Machine Learning Mastery.

]]>The most common approach to data preparation is to study a dataset and review the expectations of a machine learning algorithm, then carefully choose the most appropriate data preparation techniques to transform the raw data to best meet the expectations of the algorithm. This is slow, expensive, and requires a vast amount of expertise.

An alternative approach to data preparation is to apply a suite of common and commonly useful data preparation techniques to the raw data in parallel and combine the results of all of the transforms together into a single large dataset from which a model can be fit and evaluated.

This is an alternative philosophy for data preparation that treats data transforms as an approach to extract salient features from raw data to expose the structure of the problem to the learning algorithms. It requires learning algorithms that are scalable of weight input features and using those input features that are most relevant to the target that is being predicted.

This approach requires less expertise, is computationally effective compared to a full grid search of data preparation methods, and can aid in the discovery of unintuitive data preparation solutions that achieve good or best performance for a given predictive modeling problem.

In this tutorial, you will discover how to use feature extraction for data preparation with tabular data.

After completing this tutorial, you will know:

- Feature extraction provides an alternate approach to data preparation for tabular data, where all data transforms are applied in parallel to raw input data and combined together to create one large dataset.
- How to use the feature extraction method for data preparation to improve model performance over a baseline for a standard classification dataset.
- How to add feature selection to the feature extraction modeling pipeline to give a further lift in modeling performance on a standard dataset.

Let’s get started.

This tutorial is divided into three parts; they are:

- Feature Extraction Technique for Data Preparation
- Dataset and Performance Baseline
- Wine Classification Dataset
- Baseline Model Performance

- Feature Extraction Approach to Data Preparation

Data preparation can be challenging.

The approach that is most often prescribed and followed is to analyze the dataset, review the requirements of the algorithms, and transform the raw data to best meet the expectations of the algorithms.

This can be effective, but is also slow and can require deep expertise both with data analysis and machine learning algorithms.

An alternative approach is to treat the preparation of input variables as a hyperparameter of the modeling pipeline and to tune it along with the choice of algorithm and algorithm configuration.

This too can be an effective approach exposing unintuitive solutions and requiring very little expertise, although it can be computationally expensive.

An approach that seeks a middle ground between these two approaches to data preparation is to treat the transformation of input data as a **feature engineering** or **feature extraction** procedure. This involves applying a suite of common or commonly useful data preparation techniques to the raw data, then aggregating all features together to create one large dataset, then fit and evaluate a model on this data.

The philosophy of the approach treats each data preparation technique as a transform that extracts salient features from raw data to be presented to the learning algorithm. Ideally, such transforms untangle complex relationships and compound input variables, in turn allowing the use of simpler modeling algorithms, such as linear machine learning techniques.

For lack of a better name, we will refer to this as the “**Feature Engineering Method**” or the “**Feature Extraction Method**” for configuring data preparation for a predictive modeling project.

It allows data analysis and algorithm expertise to be used in the selection of data preparation methods and allows unintuitive solutions to be found but at a much lower computational cost.

The exclusion in the number of input features can also be explicitly addressed through the use of feature selection techniques that attempt to rank order the importance or value of the vast number of extracted features and only select a small subset of the most relevant to predicting the target variable.

We can explore this approach to data preparation with a worked example.

Before we dive into a worked example, let’s first select a standard dataset and develop a baseline in performance.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

In this section, we will first select a standard machine learning dataset and establish a baseline in performance on this dataset. This will provide the context for exploring the feature extraction method of data preparation in the next section.

We will use the wine classification dataset.

This dataset has 13 input variables that describe the chemical composition of samples of wine and requires that the wine be classified as one of three types.

You can learn more about the dataset here:

No need to download the dataset as we will download it automatically as part of our worked examples.

Open the dataset and review the raw data. The first few rows of data are listed below.

We can see that it is a multi-class classification predictive modeling problem with numerical input variables, each of which has different scales.

14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.64,1.04,3.92,1065,1 13.2,1.78,2.14,11.2,100,2.65,2.76,.26,1.28,4.38,1.05,3.4,1050,1 13.16,2.36,2.67,18.6,101,2.8,3.24,.3,2.81,5.68,1.03,3.17,1185,1 14.37,1.95,2.5,16.8,113,3.85,3.49,.24,2.18,7.8,.86,3.45,1480,1 13.24,2.59,2.87,21,118,2.8,2.69,.39,1.82,4.32,1.04,2.93,735,1 ...

The example loads the dataset and splits it into the input and output columns, then summarizes the data arrays.

# example of loading and summarizing the wine dataset from pandas import read_csv # define the location of the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/wine.csv' # load the dataset as a data frame df = read_csv(url, header=None) # retrieve the numpy array data = df.values # split the columns into input and output variables X, y = data[:, :-1], data[:, -1] # summarize the shape of the loaded data print(X.shape, y.shape)

Running the example, we can see that the dataset was loaded correctly and that there are 179 rows of data with 13 input variables and a single target variable.

(178, 13) (178,)

Next, let’s evaluate a model on this dataset and establish a baseline in performance.

We can establish a baseline in performance on the wine classification task by evaluating a model on the raw input data.

In this case, we will evaluate a logistic regression model.

First, we can perform minimum data preparation by ensuring the input variables are numeric and that the target variable is label encoded, as expected by the scikit-learn library.

... # minimally prepare dataset X = X.astype('float') y = LabelEncoder().fit_transform(y.astype('str'))

Next, we can define our predictive model.

... # define the model model = LogisticRegression(solver='liblinear')

We will evaluate the model using the gold standard of repeated stratified k-fold cross-validation with 10 folds and three repeats.

Model performance will be evaluated using classification accuracy.

... model = LogisticRegression(solver='liblinear') # define the cross-validation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

At the end of the run, we will report the mean and standard deviation of the accuracy scores collected across all repeats and evaluation folds.

... # report performance print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Tying this together, the complete example of evaluating a logistic regression model on the raw wine classification dataset is listed below.

# baseline model performance on the wine dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/wine.csv' df = read_csv(url, header=None) data = df.values X, y = data[:, :-1], data[:, -1] # minimally prepare dataset X = X.astype('float') y = LabelEncoder().fit_transform(y.astype('str')) # define the model model = LogisticRegression(solver='liblinear') # define the cross-validation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # report performance print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example evaluates the model performance and reports the mean and standard deviation classification accuracy.

In this case, we can see that the logistic regression model fit on the raw input data achieved the average classification accuracy of about 95.3 percent, providing a baseline in performance.

Accuracy: 0.953 (0.048)

Next, let’s explore whether we can improve the performance using the feature extraction based approach to data preparation.

In this section, we can explore whether we can improve performance using the feature extraction approach to data preparation.

The first step is to select a suite of common and commonly useful data preparation techniques.

In this case, given that the input variables are numeric, we will use a range of transforms to change the scale of the input variables such as MinMaxScaler, StandardScaler, and RobustScaler, as well as transforms for chaining the distribution of the input variables such as QuantileTransformer and KBinsDiscretizer. Finally, we will also use transforms that remove linear dependencies between the input variables such as PCA and TruncatedSVD.

The FeatureUnion class can be used to define a list of transforms to perform, the results of which will be aggregated together, i.e. unioned. This will create a new dataset that has a vast number of columns.

An estimate of the number of columns would be 13 input variables times five transforms or 65 plus the 14 columns output from the PCA and SVD dimensionality reduction methods, to give a total of about 79 features.

... # transforms for the feature union transforms = list() transforms.append(('mms', MinMaxScaler())) transforms.append(('ss', StandardScaler())) transforms.append(('rs', RobustScaler())) transforms.append(('qt', QuantileTransformer(n_quantiles=100, output_distribution='normal'))) transforms.append(('kbd', KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform'))) transforms.append(('pca', PCA(n_components=7))) transforms.append(('svd', TruncatedSVD(n_components=7))) # create the feature union fu = FeatureUnion(transforms)

We can then create a modeling Pipeline with the FeatureUnion as the first step and the logistic regression model as the final step.

... # define the model model = LogisticRegression(solver='liblinear') # define the pipeline steps = list() steps.append(('fu', fu)) steps.append(('m', model)) pipeline = Pipeline(steps=steps)

The pipeline can then be evaluated using repeated stratified k-fold cross-validation as before.

Tying this together, the complete example is listed below.

# data preparation as feature engineering for wine dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression from sklearn.pipeline import Pipeline from sklearn.pipeline import FeatureUnion from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import MinMaxScaler from sklearn.preprocessing import StandardScaler from sklearn.preprocessing import RobustScaler from sklearn.preprocessing import QuantileTransformer from sklearn.preprocessing import KBinsDiscretizer from sklearn.decomposition import PCA from sklearn.decomposition import TruncatedSVD # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/wine.csv' df = read_csv(url, header=None) data = df.values X, y = data[:, :-1], data[:, -1] # minimally prepare dataset X = X.astype('float') y = LabelEncoder().fit_transform(y.astype('str')) # transforms for the feature union transforms = list() transforms.append(('mms', MinMaxScaler())) transforms.append(('ss', StandardScaler())) transforms.append(('rs', RobustScaler())) transforms.append(('qt', QuantileTransformer(n_quantiles=100, output_distribution='normal'))) transforms.append(('kbd', KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform'))) transforms.append(('pca', PCA(n_components=7))) transforms.append(('svd', TruncatedSVD(n_components=7))) # create the feature union fu = FeatureUnion(transforms) # define the model model = LogisticRegression(solver='liblinear') # define the pipeline steps = list() steps.append(('fu', fu)) steps.append(('m', model)) pipeline = Pipeline(steps=steps) # define the cross-validation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # report performance print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example evaluates the model performance and reports the mean and standard deviation classification accuracy.

In this case, we can see a lift in performance over the baseline performance, achieving a mean classification accuracy of about 96.8 percent as compared to 95.3 percent in the previous section.

Accuracy: 0.968 (0.037)

Try adding more data preparation methods to the FeatureUnion to see if you can improve the performance.

**Can you get better results?**

Let me know what you discover in the comments below.

We can also use feature selection to reduce the approximately 80 extracted features down to a subset of those that are most relevant to the model. In addition to reducing the complexity of the model, it can also result in a lift in performance by removing irrelevant and redundant input features.

In this case, we will use the Recursive Feature Elimination, or RFE, technique for feature selection and configure it to select the 15 most relevant features.

... # define the feature selection rfe = RFE(estimator=LogisticRegression(solver='liblinear'), n_features_to_select=15)

We can then add the RFE feature selection to the modeling pipeline after the *FeatureUnion* and before the *LogisticRegression* algorithm.

... # define the pipeline steps = list() steps.append(('fu', fu)) steps.append(('rfe', rfe)) steps.append(('m', model)) pipeline = Pipeline(steps=steps)

Tying this together, the complete example of the feature selection data preparation method with feature selection is listed below.

# data preparation as feature engineering with feature selection for wine dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression from sklearn.pipeline import Pipeline from sklearn.pipeline import FeatureUnion from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import MinMaxScaler from sklearn.preprocessing import StandardScaler from sklearn.preprocessing import RobustScaler from sklearn.preprocessing import QuantileTransformer from sklearn.preprocessing import KBinsDiscretizer from sklearn.feature_selection import RFE from sklearn.decomposition import PCA from sklearn.decomposition import TruncatedSVD # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/wine.csv' df = read_csv(url, header=None) data = df.values X, y = data[:, :-1], data[:, -1] # minimally prepare dataset X = X.astype('float') y = LabelEncoder().fit_transform(y.astype('str')) # transforms for the feature union transforms = list() transforms.append(('mms', MinMaxScaler())) transforms.append(('ss', StandardScaler())) transforms.append(('rs', RobustScaler())) transforms.append(('qt', QuantileTransformer(n_quantiles=100, output_distribution='normal'))) transforms.append(('kbd', KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform'))) transforms.append(('pca', PCA(n_components=7))) transforms.append(('svd', TruncatedSVD(n_components=7))) # create the feature union fu = FeatureUnion(transforms) # define the feature selection rfe = RFE(estimator=LogisticRegression(solver='liblinear'), n_features_to_select=15) # define the model model = LogisticRegression(solver='liblinear') # define the pipeline steps = list() steps.append(('fu', fu)) steps.append(('rfe', rfe)) steps.append(('m', model)) pipeline = Pipeline(steps=steps) # define the cross-validation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # report performance print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Again, we can see a further lift in performance from 96.8 percent with all extracted features to about 98.9 with feature selection used prior to modeling.

Accuracy: 0.989 (0.022)

**Can you achieve better performance with a different feature selection technique or with more or fewer selected features?**

Let me know what you discover in the comments below.

This section provides more resources on the topic if you are looking to go deeper.

In this tutorial, you discovered how to use feature extraction for data preparation with tabular data.

Specifically, you learned:

- Feature extraction provides an alternate approach to data preparation for tabular data, where all data transforms are applied in parallel to raw input data and combined together to create one large dataset.
- How to use the feature extraction method for data preparation to improve model performance over a baseline for a standard classification dataset.
- How to add feature selection to the feature extraction modeling pipeline to give a further lift in modeling performance on a standard dataset.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post How to Use Feature Extraction on Tabular Data for Machine Learning appeared first on Machine Learning Mastery.

]]>The post How to Choose Data Preparation Methods for Machine Learning appeared first on Machine Learning Mastery.

]]>Correct application of data preparation will transform raw data into a representation that allows learning algorithms to get the most out of the data and make skillful predictions. The problem is choosing a transform or sequence of transforms that results in a useful representation is very challenging. So much so that it may be considered more of an art than a science.

In this tutorial, you will discover strategies that you can use to select data preparation techniques for your predictive modeling datasets.

After completing this tutorial, you will know:

- Data preparation techniques can be chosen based on detailed knowledge of the dataset and algorithm and this is the most common approach.
- Data preparation techniques can be grid searched as just another hyperparameter in the modeling pipeline.
- Data transforms can be applied to a training dataset in parallel to create many extracted features on which feature selection can be applied and a model trained.

Let’s get started.

This tutorial is divided into four parts; they are:

- Strategies for Choosing Data Preparation Techniques
- Approach 1: Manually Specify Data Preparation
- Approach 2: Grid Search Data Preparation Methods
- Approach 3: Apply Data Preparation Methods in Parallel

The performance of a machine learning model is only as good as the data used to train it.

This puts a heavy burden on the data and the techniques used to prepare it for modeling.

Data preparation refers to the techniques used to transform raw data into a form that best meets the expectations or requirements of a machine learning algorithm.

It is a challenge because we cannot know a representation of the raw data that will result in good or best performance of a predictive model.

However, we often do not know the best re-representation of the predictors to improve model performance. Instead, the re-working of predictors is more of an art, requiring the right tools and experience to find better predictor representations. Moreover, we may need to search many alternative predictor representations to improve model performance.

— Page xii, Feature Engineering and Selection, 2019.

Instead, we must use controlled experiments to systematically evaluate data transforms on a model in order to discover what works well or best.

As such, on a predictive modeling project, there are three main strategies we may decide to use in order to select a data preparation technique or sequences of techniques for a dataset; they are:

- Manually specify the data preparation to use for a given algorithm based on the deep knowledge of the data and the chosen algorithm.
- Test a suite of different data transforms and sequences of transforms and discover what works well or best on the dataset for one or range of models.
- Apply a suite of data transforms on the data in parallel to create a large number of engineered features that can be reduced using feature selection and used to train models.

Let’s take a closer look at each of these approaches in turn.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

This approach involves studying the data and the requirements of specific algorithms and selecting data transforms that change your data to best meet the requirements.

Many practitioners see this as the only possible approach to selecting data preparation techniques as it is often the only approach taught or described in textbooks.

This approach might involve first selecting an algorithm and preparing data specifically for it, or testing a suite of algorithms and ensuring the data preparation methods are tailored to each algorithm.

This approach requires having detailed knowledge about your data. This can be achieved by reviewing summary statistics for each variable, plots of data distributions, and possibly even statistical tests to see if the data matches a known distribution.

This approach also requires detailed knowledge of the algorithms you will be using. This can be achieved by reviewing textbooks that describe the algorithms.

From a high level, the data requirements of most algorithms are well known.

For example, the following algorithms will probably be sensitive to the scale and distribution of your numerical input variables, as well as the presence of irrelevant and redundant variables:

- Linear Regression (and extensions)
- Logistic Regression
- Linear Discriminant Analysis
- Gaussian Naive Bayes
- Neural Networks
- Support Vector Machines
- k-Nearest Neighbors

The following algorithms will probably not be sensitive to the scale and distribution of your numerical input variables and are reasonably insensitive to irrelevant and redundant variables:

- Decision Tree
- AdaBoost
- Bagged Decision Trees
- Random Forest
- Gradient Boosting

The benefit of this approach is that it gives you some confidence that your data has been tailored to the expectations and requirements of specific algorithms. This may result in good or even great performance.

The downside is that it can be a slow process requiring a lot of analysis, expertise, and, potentially, research. It also may result in a false sense of confidence that good or best results have already been achieved and that no or little further improvement is possible.

For more on this approach to data preparation, see the tutorial:

This approach acknowledges that algorithms may have expectations and requirements, and does ensure that transforms of the dataset are created to meet those requirements, although it does not assume that meeting them will result in the best performance.

It leaves the door open to non-obvious and unintuitive solutions.

This might be a data transform that “*should not work*” or “*should not be appropriate for the algorithm*” yet results in good or great performance. Alternatively, it may be the absence of a data transform for an input variable that is deemed “*absolutely required*” yet results in good or great performance.

This can be achieved by designing a grid search of data preparation techniques and/or sequences of data preparation techniques in pipelines. This may involve evaluating each on a single chosen machine learning algorithm, or on a suite of machine learning algorithms.

The result will be a large number of outcomes that will clearly indicate those data transforms, transform sequences, and/or transforms coupled with models that result in good or best performance on the dataset.

These could be used directly, although more likely would provide the basis for further investigation by tuning data transforms and model hyperparameters to get the most out of the methods, and ablative studies to confirm all elements of a modeling pipeline contribute to the skillful predictions.

I generally use this approach myself and advocate it to beginners or practitioners looking to achieve good results on a project quickly.

The benefit of this approach is that it always results in suggestions of modeling pipelines that give good relative results. Most importantly, it can unearth the non-obvious and unintuitive solutions to practitioners without the need for deep expertise.

The downside is the need for some programming aptitude to implement the grid search and the added computational cost of evaluating many different data preparation techniques and pipelines.

For more on this approach to data preparation, see the tutorial:

Like the previous approach, with this approach, assumes that algorithms have expectations and requirements, and it also allows for good solutions to be found that violate those expectations, although it goes one step further.

This approach also acknowledges that a model fit on multiple perspectives on the same data may be beneficial over a model that is fit on a single perspective of the data.

This is achieved by performing multiple data transforms on the raw dataset in parallel, then gathering the results from all transforms together into one large dataset with hundreds or even thousands of input features (i.e. the FeatureUnion class in scikit-learn can be used to achieve this). It allows for good input features found from different transforms to be used in parallel.

The number of input features may explode dramatically for each transform that is used. Therefore, it is good to combine this approach with a feature selection method to select a subset of the features that is most relevant to the target variable. Again, this may involve the application of one, two, or more different feature selection techniques to provide a larger than normal subset of useful features.

Alternatively, a dimensionality reduction technique (e.g. PCA) can be used on the generated features, or an algorithm that performs automatic feature selection (e.g. random forest) can be trained on the generated features directly.

I like to think of it as an explicit feature engineering approach where we generate all the features we can possibly think of from the raw data, unpacking distributions and relationships in the data. Then select a subset of the most relevant features and fit model. Because we are explicitly using data transforms to unpack the complexity of the problem into parallel features, it may allow the use of a much simpler predictive model, such as a linear model with a strong penalty to help it ignore less useful features.

A variation on this approach would be to fit a different model on each transform of the raw dataset and use an ensemble model to combine the predictions from each of the models.

A benefit of this general approach is that it allows a model to harness multiple different perspectives or views on the same raw data, a feature that the other two approaches discussed above lack. This may allow extra predictive skill to be squeezed from the dataset.

A downside of this approach is the increased computational cost, and the careful choice of the feature selection technique, and/or model used to interpret such a large number of input features.

For more on this approach to data preparation, see the tutorial:

This section provides more resources on the topic if you are looking to go deeper.

- What Is Data Preparation in a Machine Learning Project
- How to Grid Search Data Preparation Techniques
- How to Use Feature Extraction on Tabular Data for Machine Learning

In this tutorial, you discovered strategies that you can use to select data preparation techniques for your predictive modeling dataset.

Specifically, you learned:

- Data preparation techniques can be chosen based on detailed knowledge of the dataset and algorithm and this is the most common approach.
- Data preparation techniques can be grid searched as just another hyperparameter in the modeling pipeline.
- Data transforms can be applied to a training dataset in parallel to create many extracted features on which feature selection can be applied and a model trained.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post How to Choose Data Preparation Methods for Machine Learning appeared first on Machine Learning Mastery.

]]>The post 8 Top Books on Data Cleaning and Feature Engineering appeared first on Machine Learning Mastery.

]]>It is a challenging topic to discuss as the data differs in form, type, and structure from project to project.

Nevertheless, there are common data preparation tasks across projects. It is a huge field of study and goes by many names, such as “*data cleaning*,” “*data wrangling*,” “*data preprocessing*,” “*feature engineering*,” and more. Some of these are distinct data preparation tasks, and some of the terms are used to describe the entire data preparation process.

Even though it is a challenging topic to discuss, there are a number of books on the topic.

In this post, you will discover the top books on data cleaning, data preparation, feature engineering, and related topics.

Let’s get started.

The focus here is on data preparation for tabular data, e.g. data in the form of a table with rows and columns as it looks in an excel spreadsheet.

Data preparation is an important topic for all data types, although specialty methods are required for each, such as image data in computer vision, text data in natural language processing, and sequence data in time series forecasting.

Data preparation is often a chapter in a machine learning textbook, although there are books dedicated to the topic. We will focus on these books.

I have gathered all the books I can find on the topic data preparation, selected what I think are the best or better books, and organized them into three groups; they are:

- Data Cleaning
- Data Wrangling
- Feature Engineering

I will try to give the flavor of each book, including the goal, the table of contents, and where to learn more about it.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Data cleaning refers to identifying and fixing errors in the data prior to modeling, including, but not limited to, outliers, missing values, and much more.

The top books on data cleaning include:

- Bad Data Handbook: Cleaning Up The Data So You Can Get Back To Work, 2012.
- Best Practices in Data Cleaning: A Complete Guide to Everything You Need to Do Before and After Collecting Your Data, 2012.
- Data Cleaning, 2019.

Let’s take a closer look at each in turn.

The book “Bad Data Handbook: Cleaning Up The Data So You Can Get Back To Work” was edited by Q. Ethan Mccallum and was published in 2012.

Bad data is described not only as corrupt data but any data that impairs the modeling process.

It’s tough to nail down a precise definition of “Bad Data.” Some people consider it a purely hands-on, technical phenomenon: missing values, malformed records, and cranky file formats. Sure, that’s part of the picture, but Bad Data is so much more. […] Bad Data is data that gets in the way.

— Page 1, “Bad Data Handbook: Cleaning Up The Data So You Can Get Back To Work,” 2012.

It is a collection of essays by 19 machine learning practitioners and us full of useful nuggets on data preparation and management.

The complete table of contents for the book is listed below.

- Chapter 01: Setting the Pace: What Is Bad Data?
- Chapter 02: Is It Just Me, or Does This Data Smell Funny?
- Chapter 03: Data Intended for Human Consumption, Not Machine Consumption
- Chapter 04: Bad Data Lurking in Plain Text
- Chapter 05: (Re)Organizing the Web’s Data
- Chapter 06: Detecting Liars and the Confused in Contradictory Online Reviews
- Chapter 07: Will the Bad Data Please Stand Up?
- Chapter 08: Blood, Sweat, and Urine
- Chapter 09: When Data and Reality Don’t Match
- Chapter 10: Subtle Sources of Bias and Error
- Chapter 11: Don’t Let the Perfect Be the Enemy of the Good: Is Bad Data Really Bad?
- Chapter 12: When Databases Attack: A Guide for When to Stick to Files
- Chapter 13: Crouching Table, Hidden Network
- Chapter 14: Myths of Cloud Computing
- Chapter 15: The Dark Side of Data Science
- Chapter 16: How to Feed and Care for Your Machine-Learning Expert
- Chapter 17: Data Traceability
- Chapter 18: Social Media: Erasable Ink?
- Chapter 19: Data Quality Analysis Demystified: Knowing When Your Data Is Good Enough

I like this book a lot; it is full of valuable practical advice. I highly recommend it!

Learn More:

The book “Best Practices in Data Cleaning: A Complete Guide to Everything You Need to Do Before and After Collecting Your Data” was written by Jason Osborne and was published in 2012.

This is a more general textbook on data preparation for computational-based social sciences rather than machine learning specifically. Nevertheless, it contains a ton of useful advice.

My goal in writing this book is to collect, in one place, a systematic overview of what I consider to be best practices in data cleaning—things I can demonstrate as making a difference in your data analyses. I seek to change the status quo, the current state of affairs in quantitative research in the social sciences (and beyond).

— Page 2, “Best Practices in Data Cleaning: A Complete Guide to Everything You Need to Do Before and After Collecting Your Data,” 2012.

The complete table of contents for the book is listed below.

- Chapter 01: Why Data Cleaning Is Important: Debunking the Myth of Robustness
- Chapter 02: Power and Planning for Data Collection: Debunking the Myth of Adequate Power
- Chapter 03: Being True to the Target Population: Debunking the Myth of Representativeness
- Chapter 04: Using Large Data Sets With Probability Sampling Frameworks: Debunking the Myth of Equality
- Chapter 05: Screening Your Data for Potential Problems: Debunking the Myth of Perfect Data
- Chapter 06: Dealing With Missing or Incomplete Data: Debunking the Myth of Emptiness
- Chapter 07: Extreme and Influential Data Points: Debunking the Myth of Equality
- Chapter 08: Improving the Normality of Variables Through Box-Cox Transformation: Debunking the Myth of Distributional Irrelevance
- Chapter 09: Does Reliability Matter? Debunking the Myth of Perfect Measurement
- Chapter 10: Random Responding, Motivated Misresponding, and Response Sets: Debunking the Myth of the Motivated Participant
- Chapter 11: Why Dichotomizing Continuous Variables Is Rarely a Good Practice: Debunking the Myth of Categorization
- Chapter 12: The Special Challenge of Cleaning Repeated Measures Data: Lots of Pits in Which to Fall
- Chapter 13: Now That the Myths Are Debunked…: Visions of Rational Quantitative Methodology for the 21st Century

I think this is a great reference guide for general data preparation techniques, perhaps better coverage than most “machine learning” focused books given the stronger statistical focus.

Learn More:

The book “Data Cleaning” was written by Ihab Ilyas and Xu Chu, and published in 2019.

As the name suggests, the book is focused on data cleaning techniques that fix errors in raw data prior to modeling.

Data cleaning is used to refer to all kinds of tasks and activities to detect and repair errors in the data. Rather than focus on a particular data cleaning task, in this book, we give an overview of the end-to-end data cleaning process, describing various error detection and repair methods, and attempt to anchor these proposals with multiple taxonomies and views.

— Page ixx, “Data Cleaning,” 2019.

The complete table of contents for the book is listed below.

- Chapter 01: Introduction
- Chapter 02: Outlier Detection
- Chapter 03: Data Deduplication
- Chapter 04: Data Transformation
- Chapter 05: Data Quality Rule Definition and Discovery
- Chapter 06: Rule-Based Data Cleaning
- Chapter 07: Machine Learning and Probabilistic Data Cleaning
- Chapter 08: Conclusion and Future Thoughts

It is more of a textbook than a practical book and is a good fit for academics and researchers looking for both a review of methods and references to the original research papers.

Learn More:

Data wrangling is a more general or colloquial term for data preparation that might include some data cleaning and feature engineering.

The top books on data wrangling include:

- Data Wrangling with Python: Tips and Tools to Make Your Life Easier, 2016.
- Principles of Data Wrangling: Practical Techniques for Data Preparation, 2017.
- Data Wrangling with R, 2016.

Let’s take a closer look at each in turn.

The book “Data Wrangling with Python: Tips and Tools to Make Your Life Easier” was written by Jacqueline Kazil and Katharine Jarmul and was published in 2016.

The focus of this book are the tools and methods to help you get raw data into a form ready for modeling.

Data wrangling is about taking a messy or unrefined source of data and turning it into something useful.

— Page xii, “Data Wrangling with Python: Tips and Tools to Make Your Life Easier,” 2016.

This is a beginner’s book for those making their first steps into Python for data preparation and modeling, e.g. current excel users.

This book is for folks who want to explore data wrangling beyond desktop tools. If you are great at Excel and want to take your data analysis to the next level, this book will help!

— Page xii, “Data Wrangling with Python: Tips and Tools to Make Your Life Easier,” 2016.

The complete table of contents for the book is listed below.

- Chapter 01: Introduction to Python
- Chapter 02: Python Basics
- Chapter 03: Data Meant to Be Read by Machines
- Chapter 04: Working with Excel Files
- Chapter 05: PDFs and Problem Solving in Python
- Chapter 06: Acquiring and Storing Data
- Chapter 07: Data Cleanup: Investigation, Matching, and Formatting
- Chapter 08: Data Cleanup: Standardizing and Scripting
- Chapter 09: Data Exploration and Analysis
- Chapter 10: Presenting Your Data
- Chapter 11: Web Scraping: Acquiring and Storing Data from the Web
- Chapter 12: Advanced Web Scraping: Screen Scrapers and Spiders
- Chapter 13: APIs
- Chapter 14: Automation and Scaling
- Chapter 15: Conclusion

This is the book to get if you are just starting out with Python for data loading and organization.

Learn More:

The book “Principles of Data Wrangling: Practical Techniques for Data Preparation” was written by Tye Rattenbury, et al. and was published in 2017.

Data wrangling is used to describe all of the tasks related to getting data ready for modeling.

The phrase data wrangling, born in the modern context of agile analytics, is meant to describe the lion’s share of the time people spend working with data.

— Page ix, “Principles of Data Wrangling: Practical Techniques for Data Preparation,” 2017.

The complete table of contents for the book is listed below.

- Chapter 01: Introduction
- Chapter 02: A Data Workflow Framework
- Chapter 03: The Dynamics of Data Wrangling
- Chapter 04: Profiling
- Chapter 05: Transformation: Structuring
- Chapter 06: Transformation: Enriching
- Chapter 07: Using Transformation to Clean Data
- Chapter 08: Roles and Responsibilities
- Chapter 09: Data Wrangling Tools

It’s a good book, but very high level. Perhaps it is better suited to the manager than the practitioner. For example, I don’t think I saw a single line of code.

Learn More:

The book “Data Wrangling with R” was written by Bradley Boehmke and was published in 2016.

As its name suggests, this book is focused on data preparation with R.

In this book, I will help you learn the essentials of preprocessing data leveraging the R programming language to easily and quickly turn noisy data into usable pieces of information.

— Page v, Data Wrangling with R, 2016.

This is a practical book. It has lots of small, focused chapters with code examples on specific problems you will encounter during data preparation. It’s a welcome change compared to many of the other high-level books in this round-up.

The complete table of contents for the book is listed below.

- Chapter 01: The Role of Data Wrangling
- Chapter 02: Introduction to R
- Chapter 03: The Basics
- Chapter 04: Dealing with Numbers
- Chapter 05: Dealing with Character Strings
- Chapter 06: Dealing with Regular Expressions
- Chapter 07: Dealing with Factors
- Chapter 08: Dealing with Dates
- Chapter 09: Data Structure Basics
- Chapter 10: Managing Vectors
- Chapter 11: Managing Lists
- Chapter 12: Managing Matrices
- Chapter 13: Managing Data Frames
- Chapter 14: Dealing with Missing Values
- Chapter 15: Importing Data
- Chapter 16: Scraping Data
- Chapter 17: Exporting Data
- Chapter 18: Functions
- Chapter 19: Loop Control Statements
- Chapter 20: Simplify Your Code with %>%
- Chapter 21: Reshaping Your Data with tidyr
- Chapter 22: Transforming Your Data with dplyr

I’m a fan of this book, and if you are using R, you need a copy. A downside is that there is a little too much of the R basics in this book. I would rather these beleft out and the reader directed to an introductory R book, lifting the requirements on the reader slightly.

Learn More:

Feature engineering refers to creating new input variables from raw data, although it also refers to data preparation more generally.

Top books on feature engineering include:

- Feature Engineering and Selection: A Practical Approach for Predictive Models, 2019.
- Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists, 2018.

Let’s take a closer look at each in turn.

The book “Feature Engineering and Selection: A Practical Approach for Predictive Models” was written by Max Kuhn and Kjell Johnson and was published in 2019.

This book describes the general process of preparing raw data for modeling as feature engineering.

Adjusting and reworking the predictors to enable models to better uncover predictor-response relationships has been termed feature engineering.

— Page xi, “Feature Engineering and Selection: A Practical Approach for Predictive Models,” 2019.

The examples in the book are demonstrated using R, which is important, as the author Max Kuhn is also creator of the popular caret package.

An important perspective taken in the book is that data preparation is not just about meeting the expectations of modeling algorithms; it is required to best expose the underlying structure of the problem, requiring iterative trial and error. This is the same perspective that I take in general and it’s refreshing to see in a modern book.

… we often do not know the best re-representation of the predictors to improve model performance. Instead, the re-working of predictors is more of an art, requiring the right tools and experience to find better predictor representations. Moreover, we may need to search many alternative predictor representations to improve model performance.

— Page xii, “Feature Engineering and Selection: A Practical Approach for Predictive Models,” 2019.

The complete table of contents for the book is listed below.

- Chapter 1. Introduction
- Chapter 2. Illustrative Example: Predicting Risk Ischemic Stroke
- Chapter 3. A Review of the Predictive Modeling Process
- Chapter 4. Exploratory Visualizations
- Chapter 5. Encoding Categorical Predictors
- Chapter 6. Engineering Numeric Predictors
- Chapter 7. Detecting Interaction Effects
- Chapter 8. Handling Missing Data
- Chapter 9. Working with Profile Data
- Chapter 10. Feature Selection Overview
- Chapter 11. Greedy Search Methods
- Chapter 12. Global Search Methods

I think this is a must-own book, even if R is not your primary language. The breadth of the methods discussed is worth the sticker price alone.

Learn More:

The book “Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists” was written by Alice Zheng and Amanda Casari and was published in 2018.

I think this book has the most direct definitions up front of all of the books I looked at, describing a feature as a numerical input to a model and feature engineering about getting useful numerical features from the raw data. Very crisp!

A feature is a numeric representation of an aspect of raw data. Features sit between data and models in the machine learning pipeline. Feature engineering is the act of extracting features from raw data and transforming them into formats that are suitable for the machine learn‐ ing model.

— Page vii, “Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists,” 2018.

The examples are in Python and focus on using NumPy and Pandas, and there are lots of worked examples, which are great. I think this is a good sister book or Python equivalent to the above *“Data Wrangling with R*” or “*Feature Engineering and Selection*,” although perhaps with less coverage.

The complete table of contents for the book is listed below.

- Chapter 1: Machine Learning Pipeline
- Chapter 2: Fancy Tricks with Simple Numbers
- Chapter 3: Text Data: Flattening, Filtering, and Chunking
- Chapter 4: The Effects of Feature Scaling: From Bag-of-Words to Tf-Idf
- Chapter 5: Categorical variables: Counting Eggs in the Age of Robotic Chickens
- Chapter 6: Dimensionality Reduction: Squashing the Data Pancake with PCA
- Chapter 7: Nonlinear Featurization via K-Means Model Stacking
- Chapter 8: Automating the Featurizer: Image Feature Extraction and Deep Learning
- Chapter 9: Back to the Future: Building an Academic Paper Recommender
- Appendix A: Linear Modeling and Linear Algebra Basics

I like the book.

I guess I would prefer to drop the math and direct the reader to a textbook. I would also prefer the examples to focus on the machine learning modeling pipeline rather than standalone transforms. But I’m being picky and pushing hard for directly useful code on a given project.

Learn More:

You have to pick the book that is right for you, based on your needs, e.g. code or textbook, Python or R.

I own all of these books, but the two I recommend are:

- Feature Engineering and Selection: A Practical Approach for Predictive Models, 2019.
- Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists, 2018.

The reason is I like practical books and I like the R and Python perspectives when I’m figuring out what to try.

A close follow-up would be:

- Data Wrangling with R, 2016.
- Bad Data Handbook: Cleaning Up The Data So You Can Get Back To Work, 2012.

The first is super practical; the second is full of super helpful (yet super specific) advice.

For textbooks, needed for their references by most researchers, I’d probably recommend:

In this post, you discovered the top books on data cleaning, data preparation, feature engineering and related topics.

**Did I miss a good book on data preparation?**

Let me know in the comments below.

**Have you read any of the books listed?**

Let me know what you think of it in the comments.

The post 8 Top Books on Data Cleaning and Feature Engineering appeared first on Machine Learning Mastery.

]]>