Archive | Data Preparation

Box Plot of LDA Number of Components vs. Classification Accuracy

Linear Discriminant Analysis for Dimensionality Reduction in Python

Reducing the number of input variables for a predictive model is referred to as dimensionality reduction. Fewer input variables can result in a simpler predictive model that may have better performance when making predictions on new data. Linear Discriminant Analysis, or LDA for short, is a predictive modeling algorithm for multi-class classification. It can also […]

Continue Reading 2
Singular Value Decomposition for Dimensionality Reduction in Python

Singular Value Decomposition for Dimensionality Reduction in Python

Reducing the number of input variables for a predictive model is referred to as dimensionality reduction. Fewer input variables can result in a simpler predictive model that may have better performance when making predictions on new data. Perhaps the more popular technique for dimensionality reduction in machine learning is Singular Value Decomposition, or SVD for […]

Continue Reading 8
Box Plot of PCA Number of Components vs. Classification Accuracy

Principal Component Analysis for Dimensionality Reduction in Python

Reducing the number of input variables for a predictive model is referred to as dimensionality reduction. Fewer input variables can result in a simpler predictive model that may have better performance when making predictions on new data. Perhaps the most popular technique for dimensionality reduction in machine learning is Principal Component Analysis, or PCA for […]

Continue Reading 32
A Gentle Introduction to Dimensionality Reduction for Machine Learning

Introduction to Dimensionality Reduction for Machine Learning

The number of input variables or features for a dataset is referred to as its dimensionality. Dimensionality reduction refers to techniques that reduce the number of input variables in a dataset. More input features often make a predictive modeling task more challenging to model, more generally referred to as the curse of dimensionality. High-dimensionality statistics […]

Continue Reading 0
Bar Chart of XGBClassifier Feature Importance Scores

How to Calculate Feature Importance With Python

Feature importance refers to techniques that assign a score to input features based on how useful they are at predicting a target variable. There are many types and sources of feature importance scores, although popular examples include statistical correlation scores, coefficients calculated as part of linear models, decision trees, and permutation importance scores. Feature importance […]

Continue Reading 47
How to Transform Target Variables for Regression With Scikit-Learn

How to Transform Target Variables for Regression in Python

Data preparation is a big part of applied machine learning. Correctly preparing your training data can mean the difference between mediocre and extraordinary results, even with very simple linear algorithms. Performing data preparation operations, such as scaling, is relatively straightforward for input variables and has been made routine in Python via the Pipeline scikit-learn class. […]

Continue Reading 22
How to Choose Feature Selection Methods For Machine Learning

How to Choose a Feature Selection Method For Machine Learning

Feature selection is the process of reducing the number of input variables when developing a predictive model. It is desirable to reduce the number of input variables to both reduce the computational cost of modeling and, in some cases, to improve the performance of the model. Statistical-based feature selection methods involve evaluating the relationship between […]

Continue Reading 95
Bar Chart of the Input Features (x) vs The Chi Squared Feature Importance (y)

How to Perform Feature Selection with Categorical Data

Feature selection is the process of identifying and selecting a subset of input features that are most relevant to the target variable. Feature selection is often straightforward when working with real-valued data, such as using the Pearson’s correlation coefficient, but can be challenging when working with categorical data. The two most commonly used feature selection […]

Continue Reading 43