Archive | Data Preparation

Bar Chart of the Input Features (x) vs. the Mutual Information Feature Importance (y)

How to Perform Feature Selection With Numerical Input Data

Feature selection is the process of identifying and selecting a subset of input features that are most relevant to the target variable. Feature selection is often straightforward when working with real-valued input and output data, such as using the Pearson’s correlation coefficient, but can be challenging when working with numerical input data and a categorical […]

Continue Reading 6
How to Use Polynomial Features Transforms for Machine Learning

How to Use Polynomial Feature Transforms for Machine Learning

Often, the input features for a predictive modeling task interact in unexpected and often nonlinear ways. These interactions can be identified and modeled by a learning algorithm. Another approach is to engineer new features that expose these interactions and see if they improve model performance. Additionally, transforms like raising input variables to a power can […]

Continue Reading 8
Histogram Plots of Robust Scaler Transformed Input Variables for the Sonar Dataset

How to Scale Data With Outliers for Machine Learning

Many machine learning algorithms perform better when numerical input variables are scaled to a standard range. This includes algorithms that use a weighted sum of the input, like linear regression, and algorithms that use distance measures, like k-nearest neighbors. Standardizing is a popular scaling technique that subtracts the mean from values and divides by the […]

Continue Reading 11
Histogram of Data With a Gaussian Distribution

How to Use Discretization Transforms for Machine Learning

Numerical input variables may have a highly skewed or non-standard distribution. This could be caused by outliers in the data, multi-modal distributions, highly exponential distributions, and more. Many machine learning algorithms prefer or perform better when numerical input variables have a standard probability distribution. The discretization transform provides an automatic way to change a numeric […]

Continue Reading 10
Histogram of Skewed Gaussian Data After Quantile Transform

How to Use Quantile Transforms for Machine Learning

Numerical input variables may have a highly skewed or non-standard distribution. This could be caused by outliers in the data, multi-modal distributions, highly exponential distributions, and more. Many machine learning algorithms prefer or perform better when numerical input variables and even output variables in the case of regression have a standard probability distribution, such as […]

Continue Reading 10
Histogram of Skewed Gaussian Data After Power Transform

How to Use Power Transforms for Machine Learning

Machine learning algorithms like Linear Regression and Gaussian Naive Bayes assume the numerical variables have a Gaussian probability distribution. Your data may not have a Gaussian distribution and instead may have a Gaussian-like distribution (e.g. nearly Gaussian but with outliers or a skew) or a totally different distribution (e.g. exponential). As such, you may be […]

Continue Reading 15