Archive | Data Preparation

How to Choose Feature Selection Methods For Machine Learning

How to Choose a Feature Selection Method For Machine Learning

By Jason Brownlee on August 20, 2020 in Data Preparation 284

Feature selection is the process of reducing the number of input variables when developing a predictive model. It is desirable to reduce the number of input variables to both reduce the computational cost of modeling and, in some cases, to improve the performance of the model. Statistical-based feature selection methods involve evaluating the relationship between […]

Bar Chart of the Input Features (x) vs The Chi Squared Feature Importance (y)

How to Perform Feature Selection with Categorical Data

By Jason Brownlee on August 18, 2020 in Data Preparation 113

Feature selection is the process of identifying and selecting a subset of input features that are most relevant to the target variable. Feature selection is often straightforward when working with real-valued data, such as using the Pearson’s correlation coefficient, but can be challenging when working with categorical data. The two most commonly used feature selection […]

How to Save and Load Models and Data Preparation in Scikit-Learn for Later Use

How to Save and Reuse Data Preparation Objects in Scikit-Learn

By Jason Brownlee on June 30, 2020 in Data Preparation 48

It is critical that any data preparation performed on a training dataset is also performed on a new dataset in the future. This may include a test dataset when evaluating a model or new data from the domain when using a model to make predictions. Typically, the model fit on the training dataset is saved […]

How to Use Statistics to Identify Outliers in Data

How to Remove Outliers for Machine Learning

By Jason Brownlee on August 18, 2020 in Data Preparation 117

When modeling, it is important to clean the data sample to ensure that the observations best represent the problem. Sometimes a dataset can contain extreme values that are outside the range of what is expected and unlike the other data. These are called outliers and often machine learning modeling and model skill in general can […]

How to Get the Most From Your Machine Learning Data

By Jason Brownlee on June 30, 2020 in Data Preparation 3

The data that you use, and how you use it, will likely define the success of your predictive modeling problem. Data and the framing of your problem may be the point of biggest leverage on your project. Choosing the wrong data or the wrong framing for your problem may lead to a model with poor […]

Why One-Hot Encode Data in Machine Learning?

By Jason Brownlee on June 30, 2020 in Data Preparation 272

Getting started in applied machine learning can be difficult, especially when working with real-world data. Often, machine learning tutorials will recommend or require that you prepare your data in specific ways before fitting a machine learning model. One good example is to use a one-hot encoding on categorical data. Why is a one-hot encoding required? […]

How to Handle Missing Values with Python

How to Handle Missing Data with Python

By Jason Brownlee on November 28, 2023 in Data Preparation 141

Real-world data often has missing values. Data can have missing values due to unrecorded observations, incorrect or inconsistent data entry, and more. Many machine learning algorithms do not support data with missing values. So handling missing data is important for accurate data analysis and building robust models. In this tutorial, you will learn how to […]

Data Leakage in Machine Learning

By Jason Brownlee on August 15, 2020 in Data Preparation 98

Data leakage is a big problem in machine learning when developing predictive models. Data leakage is when information from outside the training dataset is used to create the model. In this post you will discover the problem of data leakage in predictive modeling. After reading this post you will know: What is data leakage is […]

An Introduction to Feature Selection

By Jason Brownlee on June 29, 2021 in Data Preparation 224

Which features should you use to create a predictive model? This is a difficult question that may require deep knowledge of the problem domain. It is possible to automatically select those features in your data that are most useful or most relevant for the problem you are working on. This is a process called feature […]

Feature engineering is hard.
Photo by Vik Nanda, some rights reserved

Discover Feature Engineering, How to Engineer Features and How to Get Good at It

By Jason Brownlee on August 15, 2020 in Data Preparation 140

Feature engineering is an informal topic, but one that is absolutely known and agreed to be key to success in applied machine learning. In creating this guide I went wide and deep and synthesized all of the material I could. You will discover what feature engineering is, what problem it solves, why it matters, how […]

← Previous 1 … 4 5 6 Next →