Archive | Data Preparation

Histogram of Skewed Gaussian Data After Power Transform

How to Use Power Transforms for Machine Learning

By Jason Brownlee on August 28, 2020 in Data Preparation 57

Machine learning algorithms like Linear Regression and Gaussian Naive Bayes assume the numerical variables have a Gaussian probability distribution. Your data may not have a Gaussian distribution and instead may have a Gaussian-like distribution (e.g. nearly Gaussian but with outliers or a skew) or a totally different distribution (e.g. exponential). As such, you may be […]

Box and Whisker Plot of Statistical Imputation Strategies Applied to the Horse Colic Dataset

Statistical Imputation for Missing Values in Machine Learning

By Jason Brownlee on August 18, 2020 in Data Preparation 45

Datasets may have missing values, and this can cause problems for many machine learning algorithms. As such, it is good practice to identify and replace missing values for each column in your input data prior to modeling your prediction task. This is called missing data imputation, or imputing for short. A popular approach for data […]

Box Plot of LDA Number of Components vs. Classification Accuracy

Linear Discriminant Analysis for Dimensionality Reduction in Python

By Jason Brownlee on August 18, 2020 in Data Preparation 15

Reducing the number of input variables for a predictive model is referred to as dimensionality reduction. Fewer input variables can result in a simpler predictive model that may have better performance when making predictions on new data. Linear Discriminant Analysis, or LDA for short, is a predictive modeling algorithm for multi-class classification. It can also […]

Singular Value Decomposition for Dimensionality Reduction in Python

By Jason Brownlee on August 18, 2020 in Data Preparation 20

Reducing the number of input variables for a predictive model is referred to as dimensionality reduction. Fewer input variables can result in a simpler predictive model that may have better performance when making predictions on new data. Perhaps the more popular technique for dimensionality reduction in machine learning is Singular Value Decomposition, or SVD for […]

Box Plot of PCA Number of Components vs. Classification Accuracy

Principal Component Analysis for Dimensionality Reduction in Python

By Jason Brownlee on August 18, 2020 in Data Preparation 83

Reducing the number of input variables for a predictive model is referred to as dimensionality reduction. Fewer input variables can result in a simpler predictive model that may have better performance when making predictions on new data. Perhaps the most popular technique for dimensionality reduction in machine learning is Principal Component Analysis, or PCA for […]

Introduction to Dimensionality Reduction for Machine Learning

By Jason Brownlee on June 30, 2020 in Data Preparation 11

The number of input variables or features for a dataset is referred to as its dimensionality. Dimensionality reduction refers to techniques that reduce the number of input variables in a dataset. More input features often make a predictive modeling task more challenging to model, more generally referred to as the curse of dimensionality. High-dimensionality statistics […]

Bar Chart of XGBClassifier Feature Importance Scores

How to Calculate Feature Importance With Python

By Jason Brownlee on August 20, 2020 in Data Preparation 237

Feature importance refers to techniques that assign a score to input features based on how useful they are at predicting a target variable. There are many types and sources of feature importance scores, although popular examples include statistical correlation scores, coefficients calculated as part of linear models, decision trees, and permutation importance scores. Feature importance […]

Line Plot of Variance Threshold (X) Versus Number of Selected Features (Y)

How to Perform Data Cleaning for Machine Learning with Python

By Jason Brownlee on June 30, 2020 in Data Preparation 68

Data cleaning is a critically important step in any machine learning project. In tabular data, there are many different statistical analysis and data visualization techniques you can use to explore your data in order to identify data cleaning operations you may want to perform. Before jumping to the sophisticated methods, there are some very basic […]

Use the ColumnTransformer for Numerical and Categorical Data in Python

How to Use the ColumnTransformer for Data Preparation

By Jason Brownlee on December 31, 2020 in Data Preparation 69

You must prepare your raw data using data transforms prior to fitting a machine learning model. This is required to ensure that you best expose the structure of your predictive modeling problem to the learning algorithms. Applying data transforms like scaling or encoding categorical variables is straightforward when all input variables are the same type. […]

How to Transform Target Variables for Regression With Scikit-Learn

How to Transform Target Variables for Regression in Python

By Jason Brownlee on October 1, 2020 in Data Preparation 61

Data preparation is a big part of applied machine learning. Correctly preparing your training data can mean the difference between mediocre and extraordinary results, even with very simple linear algorithms. Performing data preparation operations, such as scaling, is relatively straightforward for input variables and has been made routine in Python via the Pipeline scikit-learn class. […]

← Previous 1 … 3 4 5 6 Next →