How to Clean Text for Machine Learning with Python

By Jason Brownlee on August 7, 2019 in Deep Learning for Natural Language Processing 146

You cannot go straight from raw text to fitting a machine learning or deep learning model.

You must clean your text first, which means splitting it into words and handling punctuation and case.

In fact, there is a whole suite of text preparation methods that you may need to use, and the choice of methods really depends on your natural language processing task.

In this tutorial, you will discover how you can clean and prepare your text ready for modeling with machine learning.

After completing this tutorial, you will know:

How to get started by developing your own very simple text cleaning tools.
How to take a step up and use the more sophisticated methods in the NLTK library.
How to prepare text when using modern text representation methods like word embeddings.

Kick-start your project with my new book Deep Learning for Natural Language Processing, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Update Nov/2017: Fixed a code typo in the ‘split into words’ section, thanks David Comfort.

How to Develop Multilayer Perceptron Models for Time Series Forecasting
Photo by Bureau of Land Management, some rights reserved.

Tutorial Overview

This tutorial is divided into 6 parts; they are:

Metamorphosis by Franz Kafka
Text Cleaning is Task Specific
Manual Tokenization
Tokenization and Cleaning with NLTK
Additional Text Cleaning Considerations
Tips for Cleaning Text for Word Embedding

Need help with Deep Learning for Text Data?

Take my free 7-day email crash course now (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

Metamorphosis by Franz Kafka

Let’s start off by selecting a dataset.

In this tutorial, we will use the text from the book Metamorphosis by Franz Kafka. No specific reason, other than it’s short, I like it, and you may like it too. I expect it’s one of those classics that most students have to read in school.

The full text for Metamorphosis is available for free from Project Gutenberg.

Metamorphosis by Franz Kafka on Project Gutenberg

You can download the ASCII text version of the text here:

Metamorphosis by Franz Kafka Plain Text UTF-8 (may need to load the page twice).

Download the file and place it in your current working directory with the file name “metamorphosis.txt“.

The file contains header and footer information that we are not interested in, specifically copyright and license information. Open the file and delete the header and footer information and save the file as “metamorphosis_clean.txt“.

The start of the clean file should look like:

One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin.

The file should end with:

And, as if in confirmation of their new dreams and good intentions, as soon as they reached their destination Grete was the first to get up and stretch out her young body.

Poor Gregor…

Text Cleaning Is Task Specific

After actually getting a hold of your text data, the first step in cleaning up text data is to have a strong idea about what you’re trying to achieve, and in that context review your text to see what exactly might help.

Take a moment to look at the text. What do you notice?

Here’s what I see:

It’s plain text so there is no markup to parse (yay!).
The translation of the original German uses UK English (e.g. “travelling“).
The lines are artificially wrapped with new lines at about 70 characters (meh).
There are no obvious typos or spelling mistakes.
There’s punctuation like commas, apostrophes, quotes, question marks, and more.
There’s hyphenated descriptions like “armour-like”.
There’s a lot of use of the em dash (“-“) to continue sentences (maybe replace with commas?).
There are names (e.g. “Mr. Samsa“)
There does not appear to be numbers that require handling (e.g. 1999)
There are section markers (e.g. “II” and “III”), and we have removed the first “I”.

I’m sure there is a lot more going on to the trained eye.

We are going to look at general text cleaning steps in this tutorial.

Nevertheless, consider some possible objectives we may have when working with this text document.

For example:

If we were interested in developing a Kafkaesque language model, we may want to keep all of the case, quotes, and other punctuation in place.
If we were interested in classifying documents as “Kafka” and “Not Kafka,” maybe we would want to strip case, punctuation, and even trim words back to their stem.

Use your task as the lens by which to choose how to ready your text data.

Manual Tokenization

Text cleaning is hard, but the text we have chosen to work with is pretty clean already.

We could just write some Python code to clean it up manually, and this is a good exercise for those simple problems that you encounter. Tools like regular expressions and splitting strings can get you a long way.

1. Load Data

Let’s load the text data so that we can work with it.

The text is small and will load quickly and easily fit into memory. This will not always be the case and you may need to write code to memory map the file. Tools like NLTK (covered in the next section) will make working with large files much easier.

We can load the entire “metamorphosis_clean.txt” into memory as follows:

# load text
filename = 'metamorphosis_clean.txt'
file = open(filename, 'rt')
text = file.read()
file.close()

# load text

filename = 'metamorphosis_clean.txt'

file = open(filename, 'rt')

text = file.read()

file.close()

Running the example loads the whole file into memory ready to work with.

2. Split by Whitespace

Clean text often means a list of words or tokens that we can work with in our machine learning models.

This means converting the raw text into a list of words and saving it again.

A very simple way to do this would be to split the document by white space, including ” “, new lines, tabs and more. We can do this in Python with the split() function on the loaded string.

# load text
filename = 'metamorphosis_clean.txt'
file = open(filename, 'rt')
text = file.read()
file.close()
# split into words by white space
words = text.split()
print(words[:100])

# load text

filename = 'metamorphosis_clean.txt'

file = open(filename, 'rt')

text = file.read()

file.close()

# split into words by white space

words = text.split()

print(words[:100])

Running the example splits the document into a long list of words and prints the first 100 for us to review.

We can see that punctuation is preserved (e.g. “wasn’t” and “armour-like“), which is nice. We can also see that end of sentence punctuation is kept with the last word (e.g. “thought.”), which is not great.

['One', 'morning,', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams,', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin.', 'He', 'lay', 'on', 'his', 'armour-like', 'back,', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly,', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections.', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment.', 'His', 'many', 'legs,', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him,', 'waved', 'about', 'helplessly', 'as', 'he', 'looked.', '"What\'s', 'happened', 'to', 'me?"', 'he', 'thought.', 'It', "wasn't", 'a', 'dream.', 'His', 'room,', 'a', 'proper', 'human']

['One', 'morning,', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams,', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin.', 'He', 'lay', 'on', 'his', 'armour-like', 'back,', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly,', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections.', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment.', 'His', 'many', 'legs,', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him,', 'waved', 'about', 'helplessly', 'as', 'he', 'looked.', '"What\'s', 'happened', 'to', 'me?"', 'he', 'thought.', 'It', "wasn't", 'a', 'dream.', 'His', 'room,', 'a', 'proper', 'human']

3. Select Words

Another approach might be to use the regex model (re) and split the document into words by selecting for strings of alphanumeric characters (a-z, A-Z, 0-9 and ‘_’).

For example:

# load text
filename = 'metamorphosis_clean.txt'
file = open(filename, 'rt')
text = file.read()
file.close()
# split based on words only
import re
words = re.split(r'\W+', text)
print(words[:100])

# load text

filename = 'metamorphosis_clean.txt'

file = open(filename, 'rt')

text = file.read()

file.close()

# split based on words only

import re

words = re.split(r'\W+', text)

print(words[:100])

Again, running the example we can see that we get our list of words. This time, we can see that “armour-like” is now two words “armour” and “like” (fine) but contractions like “What’s” is also two words “What” and “s” (not great).

['One', 'morning', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', 'He', 'lay', 'on', 'his', 'armour', 'like', 'back', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', 'His', 'many', 'legs', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', 'What', 's', 'happened', 'to', 'me', 'he', 'thought', 'It', 'wasn', 't', 'a', 'dream', 'His', 'room']

['One', 'morning', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', 'He', 'lay', 'on', 'his', 'armour', 'like', 'back', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', 'His', 'many', 'legs', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', 'What', 's', 'happened', 'to', 'me', 'he', 'thought', 'It', 'wasn', 't', 'a', 'dream', 'His', 'room']

3. Split by Whitespace and Remove Punctuation

Note: This example was written for Python 3.

We may want the words, but without the punctuation like commas and quotes. We also want to keep contractions together.

One way would be to split the document into words by white space (as in “2. Split by Whitespace“), then use string translation to replace all punctuation with nothing (e.g. remove it).

Python provides a constant called string.punctuation that provides a great list of punctuation characters. For example:

print(string.punctuation)

1	print(string.punctuation)

Results in:

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

1	!"#$%&'()*+,-./:;<=>?@[\]^_`{\|}~

Python offers a function called translate() that will map one set of characters to another.

We can use the function maketrans() to create a mapping table. We can create an empty mapping table, but the third argument of this function allows us to list all of the characters to remove during the translation process. For example:

table = str.maketrans('', '', string.punctuation)

1	table = str.maketrans('', '', string.punctuation)

We can put all of this together, load the text file, split it into words by white space, then translate each word to remove the punctuation.

# load text
filename = 'metamorphosis_clean.txt'
file = open(filename, 'rt')
text = file.read()
file.close()
# split into words by white space
words = text.split()
# remove punctuation from each word
import string
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in words]
print(stripped[:100])

# load text

filename = 'metamorphosis_clean.txt'

file = open(filename, 'rt')

text = file.read()

file.close()

# split into words by white space

words = text.split()

# remove punctuation from each word

import string

table = str.maketrans('', '', string.punctuation)

stripped = [w.translate(table) for w in words]

print(stripped[:100])

We can see that this has had the desired effect, mostly.

Contractions like “What’s” have become “Whats” but “armour-like” has become “armourlike“.

['One', 'morning', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', 'He', 'lay', 'on', 'his', 'armourlike', 'back', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', 'His', 'many', 'legs', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', 'Whats', 'happened', 'to', 'me', 'he', 'thought', 'It', 'wasnt', 'a', 'dream', 'His', 'room', 'a', 'proper', 'human']

['One', 'morning', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', 'He', 'lay', 'on', 'his', 'armourlike', 'back', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', 'His', 'many', 'legs', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', 'Whats', 'happened', 'to', 'me', 'he', 'thought', 'It', 'wasnt', 'a', 'dream', 'His', 'room', 'a', 'proper', 'human']

If you know anything about regex, then you know things can get complex from here.

4. Normalizing Case

It is common to convert all words to one case.

This means that the vocabulary will shrink in size, but some distinctions are lost (e.g. “Apple” the company vs “apple” the fruit is a commonly used example).

We can convert all words to lowercase by calling the lower() function on each word.

For example:

filename = 'metamorphosis_clean.txt'
file = open(filename, 'rt')
text = file.read()
file.close()
# split into words by white space
words = text.split()
# convert to lower case
words = [word.lower() for word in words]
print(words[:100])

filename = 'metamorphosis_clean.txt'

file = open(filename, 'rt')

text = file.read()

file.close()

# split into words by white space

words = text.split()

# convert to lower case

words = [word.lower() for word in words]

print(words[:100])

Running the example, we can see that all words are now lowercase.

['one', 'morning,', 'when', 'gregor', 'samsa', 'woke', 'from', 'troubled', 'dreams,', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin.', 'he', 'lay', 'on', 'his', 'armour-like', 'back,', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly,', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections.', 'the', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment.', 'his', 'many', 'legs,', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him,', 'waved', 'about', 'helplessly', 'as', 'he', 'looked.', '"what\'s', 'happened', 'to', 'me?"', 'he', 'thought.', 'it', "wasn't", 'a', 'dream.', 'his', 'room,', 'a', 'proper', 'human']

['one', 'morning,', 'when', 'gregor', 'samsa', 'woke', 'from', 'troubled', 'dreams,', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin.', 'he', 'lay', 'on', 'his', 'armour-like', 'back,', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly,', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections.', 'the', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment.', 'his', 'many', 'legs,', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him,', 'waved', 'about', 'helplessly', 'as', 'he', 'looked.', '"what\'s', 'happened', 'to', 'me?"', 'he', 'thought.', 'it', "wasn't", 'a', 'dream.', 'his', 'room,', 'a', 'proper', 'human']

Note

Cleaning text is really hard, problem specific, and full of tradeoffs.

Remember, simple is better.

Simpler text data, simpler models, smaller vocabularies. You can always make things more complex later to see if it results in better model skill.

Next, we’ll look at some of the tools in the NLTK library that offer more than simple string splitting.

Tokenization and Cleaning with NLTK

The Natural Language Toolkit, or NLTK for short, is a Python library written for working and modeling text.

It provides good tools for loading and cleaning text that we can use to get our data ready for working with machine learning and deep learning algorithms.

1. Install NLTK

You can install NLTK using your favorite package manager, such as pip:

sudo pip install -U nltk

1	sudo pip install -U nltk

After installation, you will need to install the data used with the library, including a great set of documents that you can use later for testing other tools in NLTK.

There are few ways to do this, such as from within a script:

import nltk
nltk.download()

1 2	import nltk nltk.download()

Or from the command line:

python -m nltk.downloader all

1	python -m nltk.downloader all

For more help installing and setting up NLTK, see:

2. Split into Sentences

A good useful first step is to split the text into sentences.

Some modeling tasks prefer input to be in the form of paragraphs or sentences, such as word2vec. You could first split your text into sentences, split each sentence into words, then save each sentence to file, one per line.

NLTK provides the sent_tokenize() function to split text into sentences.

The example below loads the “metamorphosis_clean.txt” file into memory, splits it into sentences, and prints the first sentence.

# load data
filename = 'metamorphosis_clean.txt'
file = open(filename, 'rt')
text = file.read()
file.close()
# split into sentences
from nltk import sent_tokenize
sentences = sent_tokenize(text)
print(sentences[0])

# load data

filename = 'metamorphosis_clean.txt'

file = open(filename, 'rt')

text = file.read()

file.close()

# split into sentences

from nltk import sent_tokenize

sentences = sent_tokenize(text)

print(sentences[0])

Running the example, we can see that although the document is split into sentences, that each sentence still preserves the new line from the artificial wrap of the lines in the original document.

One morning, when Gregor Samsa woke from troubled dreams, he found
himself transformed in his bed into a horrible vermin.

3. Split into Words

NLTK provides a function called word_tokenize() for splitting strings into tokens (nominally words).

It splits tokens based on white space and punctuation. For example, commas and periods are taken as separate tokens. Contractions are split apart (e.g. “What’s” becomes “What” “‘s“). Quotes are kept, and so on.

For example:

# load data
filename = 'metamorphosis_clean.txt'
file = open(filename, 'rt')
text = file.read()
file.close()
# split into words
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
print(tokens[:100])

# load data

filename = 'metamorphosis_clean.txt'

file = open(filename, 'rt')

text = file.read()

file.close()

# split into words

from nltk.tokenize import word_tokenize

tokens = word_tokenize(text)

print(tokens[:100])

Running the code, we can see that punctuation are now tokens that we could then decide to specifically filter out.

['One', 'morning', ',', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', ',', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', '.', 'He', 'lay', 'on', 'his', 'armour-like', 'back', ',', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', ',', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', '.', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', '.', 'His', 'many', 'legs', ',', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', ',', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', '.', '``', 'What', "'s", 'happened', 'to']

['One', 'morning', ',', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', ',', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', '.', 'He', 'lay', 'on', 'his', 'armour-like', 'back', ',', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', ',', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', '.', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', '.', 'His', 'many', 'legs', ',', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', ',', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', '.', '``', 'What', "'s", 'happened', 'to']

4. Filter Out Punctuation

We can filter out all tokens that we are not interested in, such as all standalone punctuation.

This can be done by iterating over all tokens and only keeping those tokens that are all alphabetic. Python has the function isalpha() that can be used. For example:

# load data
filename = 'metamorphosis_clean.txt'
file = open(filename, 'rt')
text = file.read()
file.close()
# split into words
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
# remove all tokens that are not alphabetic
words = [word for word in tokens if word.isalpha()]
print(words[:100])

# load data

filename = 'metamorphosis_clean.txt'

file = open(filename, 'rt')

text = file.read()

file.close()

# split into words

from nltk.tokenize import word_tokenize

tokens = word_tokenize(text)

# remove all tokens that are not alphabetic

words = [word for word in tokens if word.isalpha()]

print(words[:100])

Running the example, you can see that not only punctuation tokens, but examples like “armour-like” and “‘s” were also filtered out.

['One', 'morning', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', 'He', 'lay', 'on', 'his', 'back', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', 'His', 'many', 'legs', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', 'What', 'happened', 'to', 'me', 'he', 'thought', 'It', 'was', 'a', 'dream', 'His', 'room', 'a', 'proper', 'human', 'room']

['One', 'morning', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', 'He', 'lay', 'on', 'his', 'back', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', 'His', 'many', 'legs', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', 'What', 'happened', 'to', 'me', 'he', 'thought', 'It', 'was', 'a', 'dream', 'His', 'room', 'a', 'proper', 'human', 'room']

5. Filter out Stop Words (and Pipeline)

Stop words are those words that do not contribute to the deeper meaning of the phrase.

They are the most common words such as: “the“, “a“, and “is“.

For some applications like documentation classification, it may make sense to remove stop words.

NLTK provides a list of commonly agreed upon stop words for a variety of languages, such as English. They can be loaded as follows:

from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print(stop_words)

from nltk.corpus import stopwords

stop_words = stopwords.words('english')

print(stop_words)

You can see the full list as follows:

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', 'couldn', 'didn', 'doesn', 'hadn', 'hasn', 'haven', 'isn', 'ma', 'mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn']

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', 'couldn', 'didn', 'doesn', 'hadn', 'hasn', 'haven', 'isn', 'ma', 'mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn']

You can see that they are all lower case and have punctuation removed.

You could compare your tokens to the stop words and filter them out, but you must ensure that your text is prepared the same way.

Let’s demonstrate this with a small pipeline of text preparation including:

Load the raw text.
Split into tokens.
Convert to lowercase.
Remove punctuation from each token.
Filter out remaining tokens that are not alphabetic.
Filter out tokens that are stop words.

# load data
filename = 'metamorphosis_clean.txt'
file = open(filename, 'rt')
text = file.read()
file.close()
# split into words
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
# convert to lower case
tokens = [w.lower() for w in tokens]
# remove punctuation from each word
import string
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in tokens]
# remove remaining tokens that are not alphabetic
words = [word for word in stripped if word.isalpha()]
# filter out stop words
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
words = [w for w in words if not w in stop_words]
print(words[:100])

# load data

filename = 'metamorphosis_clean.txt'

file = open(filename, 'rt')

text = file.read()

file.close()

# split into words

from nltk.tokenize import word_tokenize

tokens = word_tokenize(text)

# convert to lower case

tokens = [w.lower() for w in tokens]

# remove punctuation from each word

import string

table = str.maketrans('', '', string.punctuation)

stripped = [w.translate(table) for w in tokens]

# remove remaining tokens that are not alphabetic

words = [word for word in stripped if word.isalpha()]

# filter out stop words

from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

words = [w for w in words if not w in stop_words]

print(words[:100])

Running this example, we can see that in addition to all of the other transforms, stop words like “a” and “to” have been removed.

I note that we are still left with tokens like “nt“. The rabbit hole is deep; there’s always more we can do.

['one', 'morning', 'gregor', 'samsa', 'woke', 'troubled', 'dreams', 'found', 'transformed', 'bed', 'horrible', 'vermin', 'lay', 'armourlike', 'back', 'lifted', 'head', 'little', 'could', 'see', 'brown', 'belly', 'slightly', 'domed', 'divided', 'arches', 'stiff', 'sections', 'bedding', 'hardly', 'able', 'cover', 'seemed', 'ready', 'slide', 'moment', 'many', 'legs', 'pitifully', 'thin', 'compared', 'size', 'rest', 'waved', 'helplessly', 'looked', 'happened', 'thought', 'nt', 'dream', 'room', 'proper', 'human', 'room', 'although', 'little', 'small', 'lay', 'peacefully', 'four', 'familiar', 'walls', 'collection', 'textile', 'samples', 'lay', 'spread', 'table', 'samsa', 'travelling', 'salesman', 'hung', 'picture', 'recently', 'cut', 'illustrated', 'magazine', 'housed', 'nice', 'gilded', 'frame', 'showed', 'lady', 'fitted', 'fur', 'hat', 'fur', 'boa', 'sat', 'upright', 'raising', 'heavy', 'fur', 'muff', 'covered', 'whole', 'lower', 'arm', 'towards', 'viewer']

['one', 'morning', 'gregor', 'samsa', 'woke', 'troubled', 'dreams', 'found', 'transformed', 'bed', 'horrible', 'vermin', 'lay', 'armourlike', 'back', 'lifted', 'head', 'little', 'could', 'see', 'brown', 'belly', 'slightly', 'domed', 'divided', 'arches', 'stiff', 'sections', 'bedding', 'hardly', 'able', 'cover', 'seemed', 'ready', 'slide', 'moment', 'many', 'legs', 'pitifully', 'thin', 'compared', 'size', 'rest', 'waved', 'helplessly', 'looked', 'happened', 'thought', 'nt', 'dream', 'room', 'proper', 'human', 'room', 'although', 'little', 'small', 'lay', 'peacefully', 'four', 'familiar', 'walls', 'collection', 'textile', 'samples', 'lay', 'spread', 'table', 'samsa', 'travelling', 'salesman', 'hung', 'picture', 'recently', 'cut', 'illustrated', 'magazine', 'housed', 'nice', 'gilded', 'frame', 'showed', 'lady', 'fitted', 'fur', 'hat', 'fur', 'boa', 'sat', 'upright', 'raising', 'heavy', 'fur', 'muff', 'covered', 'whole', 'lower', 'arm', 'towards', 'viewer']

6. Stem Words

Stemming refers to the process of reducing each word to its root or base.

For example “fishing,” “fished,” “fisher” all reduce to the stem “fish.”

Some applications, like document classification, may benefit from stemming in order to both reduce the vocabulary and to focus on the sense or sentiment of a document rather than deeper meaning.

There are many stemming algorithms, although a popular and long-standing method is the Porter Stemming algorithm. This method is available in NLTK via the PorterStemmer class.

For example:

# load data
filename = 'metamorphosis_clean.txt'
file = open(filename, 'rt')
text = file.read()
file.close()
# split into words
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
# stemming of words
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
stemmed = [porter.stem(word) for word in tokens]
print(stemmed[:100])

# load data

filename = 'metamorphosis_clean.txt'

file = open(filename, 'rt')

text = file.read()

file.close()

# split into words

from nltk.tokenize import word_tokenize

tokens = word_tokenize(text)

# stemming of words

from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()

stemmed = [porter.stem(word) for word in tokens]

print(stemmed[:100])

Running the example, you can see that words have been reduced to their stems, such as “trouble” has become “troubl“. You can also see that the stemming implementation has also reduced the tokens to lowercase, likely for internal look-ups in word tables.

You can also see that the stemming implementation has also reduced the tokens to lowercase, likely for internal look-ups in word tables.

['one', 'morn', ',', 'when', 'gregor', 'samsa', 'woke', 'from', 'troubl', 'dream', ',', 'he', 'found', 'himself', 'transform', 'in', 'hi', 'bed', 'into', 'a', 'horribl', 'vermin', '.', 'He', 'lay', 'on', 'hi', 'armour-lik', 'back', ',', 'and', 'if', 'he', 'lift', 'hi', 'head', 'a', 'littl', 'he', 'could', 'see', 'hi', 'brown', 'belli', ',', 'slightli', 'dome', 'and', 'divid', 'by', 'arch', 'into', 'stiff', 'section', '.', 'the', 'bed', 'wa', 'hardli', 'abl', 'to', 'cover', 'it', 'and', 'seem', 'readi', 'to', 'slide', 'off', 'ani', 'moment', '.', 'hi', 'mani', 'leg', ',', 'piti', 'thin', 'compar', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', ',', 'wave', 'about', 'helplessli', 'as', 'he', 'look', '.', '``', 'what', "'s", 'happen', 'to'

['one', 'morn', ',', 'when', 'gregor', 'samsa', 'woke', 'from', 'troubl', 'dream', ',', 'he', 'found', 'himself', 'transform', 'in', 'hi', 'bed', 'into', 'a', 'horribl', 'vermin', '.', 'He', 'lay', 'on', 'hi', 'armour-lik', 'back', ',', 'and', 'if', 'he', 'lift', 'hi', 'head', 'a', 'littl', 'he', 'could', 'see', 'hi', 'brown', 'belli', ',', 'slightli', 'dome', 'and', 'divid', 'by', 'arch', 'into', 'stiff', 'section', '.', 'the', 'bed', 'wa', 'hardli', 'abl', 'to', 'cover', 'it', 'and', 'seem', 'readi', 'to', 'slide', 'off', 'ani', 'moment', '.', 'hi', 'mani', 'leg', ',', 'piti', 'thin', 'compar', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', ',', 'wave', 'about', 'helplessli', 'as', 'he', 'look', '.', '``', 'what', "'s", 'happen', 'to'

There is a nice suite of stemming and lemmatization algorithms to choose from in NLTK, if reducing words to their root is something you need for your project.

Additional Text Cleaning Considerations

We are only getting started.

Because the source text for this tutorial was reasonably clean to begin with, we skipped many concerns of text cleaning that you may need to deal with in your own project.

Here is a short list of additional considerations when cleaning text:

Handling large documents and large collections of text documents that do not fit into memory.
Extracting text from markup like HTML, PDF, or other structured document formats.
Transliteration of characters from other languages into English.
Decoding Unicode characters into a normalized form, such as UTF8.
Handling of domain specific words, phrases, and acronyms.
Handling or removing numbers, such as dates and amounts.
Locating and correcting common typos and misspellings.
…

The list could go on.

Hopefully, you can see that getting truly clean text is impossible, that we are really doing the best we can based on the time, resources, and knowledge we have.

The idea of “clean” is really defined by the specific task or concern of your project.

A pro tip is to continually review your tokens after every transform. I have tried to show that in this tutorial and I hope you take that to heart.

Ideally, you would save a new file after each transform so that you can spend time with all of the data in the new form. Things always jump out at you when to take the time to review your data.

Have you done some text cleaning before? What are you preferred pipeline of transforms?
Let me know in the comments below.

Tips for Cleaning Text for Word Embedding

Recently, the field of natural language processing has been moving away from bag-of-word models and word encoding toward word embeddings.

The benefit of word embeddings is that they encode each word into a dense vector that captures something about its relative meaning within the training text.

This means that variations of words like case, spelling, punctuation, and so on will automatically be learned to be similar in the embedding space. In turn, this can mean that the amount of cleaning required from your text may be less and perhaps quite different to classical text cleaning.

For example, it may no-longer make sense to stem words or remove punctuation for contractions.

Tomas Mikolov is one of the developers of word2vec, a popular word embedding method. He suggests only very minimal text cleaning is required when learning a word embedding model.

Below is his response when pressed with the question about how to best prepare text data for word2vec.

There is no universal answer. It all depends on what you plan to use the vectors for. In my experience, it is usually good to disconnect (or remove) punctuation from words, and sometimes also convert all characters to lowercase. One can also replace all numbers (possibly greater than some constant) with some single token such as .

All these pre-processing steps aim to reduce the vocabulary size without removing any important content (which in some cases may not be true when you lowercase certain words, ie. ‘Bush’ is different than ‘bush’, while ‘Another’ has usually the same sense as ‘another’). The smaller the vocabulary is, the lower is the memory complexity, and the more robustly are the parameters for the words estimated. You also have to pre-process the test data in the same way.

…

In short, you will understand all this much better if you will run experiments.

Read the full thread on Google Groups.

Summary

In this tutorial, you discovered how to clean text or machine learning in Python.

Specifically, you learned:

How to get started by developing your own very simple text cleaning tools.
How to take a step up and use the more sophisticated methods in the NLTK library.
How to prepare text when using modern text representation methods like word embeddings.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Do you have experience with cleaning text?
Please share your experiences in the comments below.

146 Responses to How to Clean Text for Machine Learning with Python

Alexander October 18, 2017 at 7:51 pm #

Thank you, Jason. Very interest work.

Reply
- Jason Brownlee October 19, 2017 at 5:35 am #
  
  I’m glad it helps.
  
  Reply
  - Niwaha Barnabas March 13, 2019 at 2:53 am #
    
    runfile(‘C:/Users/barnabas/.spyder-py3/temp.py’, wdir=’C:/Users/barnabas/.spyder-py3′)
    theano: 1.0.3
    tensorflow: 2.0.0-alpha0
    keras: 2.2.4
    Using TensorFlow backend.
    
    it helps
    
    Reply
    - Jason Brownlee March 13, 2019 at 7:58 am #
      
      Well done!
      
      Reply
  - bilal HADDOUCHE October 28, 2019 at 3:57 am #
    
    thank you
    
    Reply
  - karthik April 4, 2022 at 1:28 am #
    
    hi Jason it would be helpful if u try to clear my doubt
    
    If Categorical or numerical data is missing we can create dummy variables with mean or mode
    
    how to clean text data column, suppose any few data is missing or NaN
    
    Reply
    - James Carmichael April 4, 2022 at 8:57 am #
      
      Hi Karthik…You may find the following resources beneficial:
      
      https://machinelearningmastery.com/handle-missing-data-python/
      
      https://machinelearningmastery.com/knn-imputation-for-missing-values-in-machine-learning/
      
      Reply
Marc October 19, 2017 at 2:22 am #

Lemmatization is also something useful in NLTK. I recommend the course “Applied Text Mining in Python” from Coursera. Anyway, this is a good intro, thanks for it Jason

Reply
- Jason Brownlee October 19, 2017 at 5:39 am #
  
  Thanks for the note Marc.
  
  Reply
Muhammed B. October 20, 2017 at 4:17 pm #

Excellent Jason thanks a lot

Reply
- Jason Brownlee October 21, 2017 at 5:26 am #
  
  Thanks.
  
  Reply
Vikash October 20, 2017 at 5:33 pm #

Complete list. I also face trouble with cleaning unicodes etc at times.

Sometimes the problem is with extract keywords from sentences. Other times it is with replacing keywords with standardised names in text. I wrote a very fast library for that called FlashText. It’s way faster than compiled regex. Here is a link to the article: https://medium.com/@vi3k6i5/search-millions-of-documents-for-thousands-of-keywords-in-a-flash-b39e5d1e126a

And thanks again for this.

Reply
- Jason Brownlee October 21, 2017 at 5:27 am #
  
  Thanks for the link.
  
  Reply
Valeriy October 20, 2017 at 6:51 pm #

Thank you very much. Visually, clear it is also very useful.
Thanks

Reply
- Jason Brownlee October 21, 2017 at 5:27 am #
  
  Thanks!
  
  Reply
Carlos October 23, 2017 at 3:30 am #

Very clear and comprehensive!

For certain applications like slot tagging tokenizing on punctuation but keeping the punctuation as tokens can be useful too.

Another common thing to do is to trim the resulting vocabulary by just taking the top K words or removing words with low document frequency. And finally dealing with numbers! You can convert all numbers to the same token, have one token for each number of digits or drop them altogether.

Like you said, this is all very application specific so it takes a fair amount of experimentation. Thanks again!

Reply
- Jason Brownlee October 23, 2017 at 5:49 am #
  
  Great tips Carlos, cheers!
  
  Reply
Ahmed October 30, 2017 at 7:41 am #

It helped me a lot especially that it not many who write in very good teaching way like yourself.
I have question, if I want to make bag of words of text I scraped from many websites,
I made pre processing the data and now I can get all words I need in one file but how to make the vector and assign the numbers to each web site?

Reply
- Jason Brownlee October 30, 2017 at 3:41 pm #
  
  Good question, this post can show you how to encode your text:
  https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/
  
  Reply
Will November 6, 2017 at 8:08 am #

This is really great, thanks for this

Reply
- Jason Brownlee November 7, 2017 at 9:44 am #
  
  Thanks, I’m glad it helps.
  
  Reply
sarmad November 11, 2017 at 2:25 am #

Jason Brownlee do you have any tutor how can I save vector into desk, every time I parse my document I have to process .

Reply
- Jason Brownlee November 11, 2017 at 9:23 am #
  
  Good question, you could use pickle or save the numpy arrays directly.
  
  Reply
David Comfort November 30, 2017 at 4:01 am #

Jason, it looks like you have a typo in the lines
# remove all tokens that are not alphabetic
words = [word for word in tokens if word.isalpha()]
print(tokens[:100])

It should be:
# remove all tokens that are not alphabetic
words = [word for word in tokens if word.isalpha()]
print(words[:100])

Reply
- Jason Brownlee November 30, 2017 at 8:22 am #
  
  Thanks David, fixed!
  
  Reply
Maryam December 29, 2017 at 9:03 am #

Thanks Jason for this great article,

I’d be happy to know how I can remove quoted texts from a sample since these quoted texts are not originally written by my students.

Do you also have any suggestions for removing Tables and other graphical representations?

I’ll appreciate it,

Reply
- Jason Brownlee December 29, 2017 at 2:36 pm #
  
  Perhaps some custom regex targeted to those examples?
  
  Reply
Demetri January 15, 2018 at 9:31 pm #

Hi Jason,

Very interesting work indeed. I was wondering if there is maybe some work you (or anyone read this) can refer me to that places punctuation in unpunctuated text.

Reply
- Jason Brownlee January 16, 2018 at 7:33 am #
  
  You could learn that via a deep LSTM.
  
  Reply
  - Demetri January 22, 2018 at 7:32 pm #
    
    Hi Jason,
    
    I have been researching deep LSTM and Matlab, and I haven’t found much useful papers/articles on punctuation insertion. Do you maybe have some papers/exercises you recommend on building a punctuation system?
    
    Reply
    - Jason Brownlee January 23, 2018 at 7:52 am #
      
      No, I’d recommend starting building one directly. Start by defining a small dataset for training a model.
      
      Reply
      - Demetri January 23, 2018 at 7:40 pm #
        
        Could you maybe elaborate?
        
        What I have now is the following:
        – Ebooks, a lot. These are test and training data (dataset.
        
        – Python script to remove all punctuation and capital letters. The punctuation marks with corresponding index number are
        stored in a table. This table will be used to evaluate the punctuation of unpunctuated text. I will create a new table when
        the unpunctuated text has been punctuated, and compare the two created tables.
        
        You said that I have to build my own deep LSTM. How would I start with this?
      - Jason Brownlee January 24, 2018 at 9:53 am #
        
        I’m eager to help, but I don;t have the capacity to develop this model for you or spec one.
      - Demetri January 25, 2018 at 4:50 pm #
        
        I have found some code that might do the trick. Will try to get it to work.
      - Jason Brownlee January 26, 2018 at 5:38 am #
        
        Let me know how you go.
nag January 29, 2018 at 8:53 pm #

Thanks. Its really helpful

Reply
- Jason Brownlee January 30, 2018 at 9:49 am #
  
  You’re welcome.
  
  Reply
- Simna Ashraf June 7, 2020 at 3:32 pm #
  
  And also how to to word embedding like in sentance?Is there any example of codes.. Like I tried BERT..
  
  Reply
  - Jason Brownlee June 8, 2020 at 6:03 am #
    
    This will hep with embeddings:
    https://machinelearningmastery.com/develop-word-embeddings-python-gensim/
    
    Reply
Viswajith January 30, 2018 at 5:30 am #

Your website is extremely helpful in providing a launchpad of different ML skills. Thanks for that Jason, I have a question : when using NLTK it will eliminate the stopwords using their own corpus. If we were to do the same manually, we go about building the set of stopwords, have it in a pickle file and then eliminate them?

Reply
- Jason Brownlee January 30, 2018 at 9:57 am #
  
  NLTK provides a pickle of them.
  
  from nltk.corpus import stopwords stop_words = stopwords.words('english') print(stop_words)
  
  1
  2
  3
  
  from nltk.corpus import stopwords
  stop_words = stopwords.words('english')
  print(stop_words)
  
  Reply
  - obsa October 26, 2019 at 12:19 am #
    
    Good help Jason!
    I want develop image captioning by using my mother tongue languages,but my language is resource scarce,how can you help me?
    
    Reply
    - Jason Brownlee October 26, 2019 at 4:40 am #
      
      Perhaps start by finding images and their captions in your language that you can use as a training dataset?
      
      Reply
Mei February 7, 2018 at 3:44 am #

Great article. Can you give me an example code of how to remove names from the corpus? Thanks!

Reply
- Jason Brownlee February 7, 2018 at 9:27 am #
  
  Sorry, I do not have an example. Perhaps you can make a list of names and remove them from the text.
  
  Reply
Shravan Singh Sisodiya March 29, 2018 at 11:58 am #

Thank you so much for great article.

now can i apply algorithm.

Reply
- Jason Brownlee March 29, 2018 at 3:18 pm #
  
  You’re welcome.
  
  See here:
  https://machinelearningmastery.com/start-here/#nlp
  
  Reply
sanket parate April 30, 2018 at 9:35 pm #

How to treat with the shortcut words like bcz,u, thr etc in text mining?

Reply
- Jason Brownlee May 1, 2018 at 5:32 am #
  
  It is tough.
  
  Two brutal approaches are:
  
  If you have a lot of data, model them. If you don’t remove them.
  
  Reply
Sonman May 30, 2018 at 1:14 am #

Thank you Jason.
I start to understand for cleaning text data.

Reply
- Jason Brownlee May 30, 2018 at 6:44 am #
  
  Well done.
  
  Reply
ronn June 1, 2018 at 4:07 pm #

thank you jason for a good information…
but i have one question…can i preprocessing (tokenize,stopword,stemming) from a file (example CSV)?..because i have a thousand document and it is saved in excel (CSV)…
thank you

Reply
- Jason Brownlee June 2, 2018 at 6:26 am #
  
  Sure.
  
  Reply
Walid Saba June 23, 2018 at 8:47 am #

You say “Stop words are those words that do not contribute to the deeper meaning of the phrase.” so (1) and (2) mean the same thing?

(1) I ordered a pizza FOR John
(2) I ordered a pizza WITH John

and these also mean the same:

(1) Mary will not take the job UNLESS it is in NY
(2) Mary will not take the job BECAUSE it is in NY

You know, too much “data-driven” and “machine learning” NLP is not good for you!!!!
Reading some LOGICAL semantics – that stuff that was worked on for centuries is lacking, is my diagnosis (or, “little knowledge is dangerous”)

Reply
- Jason Brownlee June 24, 2018 at 7:26 am #
  
  Thanks.
  
  Reply
Swapnil Patil July 17, 2018 at 5:52 pm #

Hey jason ,

Do you have any clue for creating meaningful sentence from tokenize words.

Please let me know ASAP .

Thanks in Advance.

Reply
- Jason Brownlee July 18, 2018 at 6:31 am #
  
  What do you mean exactly?
  
  Reply
Nasir Hussain July 21, 2018 at 11:29 pm #

thank you Jason for a good information

Reply
- Jason Brownlee July 22, 2018 at 6:24 am #
  
  I’m glad it helped.
  
  Reply
Sunil July 24, 2018 at 7:15 pm #

The tutorial is very helpful. Are there any online free reference pdf’s for word2vec. Please let me know. Thanks for publishing this.

Reply
- Jason Brownlee July 25, 2018 at 6:16 am #
  
  You can learn more about word2vec here:
  https://machinelearningmastery.com/develop-word-embeddings-python-gensim/
  
  Reply
sunil July 25, 2018 at 2:25 am #

hi Jason,

Link was very helpful. Thanks for sharing. I had a question, what is the best algorithm to find if certain keywords are present in the sentence? Meaning I know what are the keywords I am looking at, problem statement here is to know whether it is present or not.

Thanks,
Sunil

Reply
- Jason Brownlee July 25, 2018 at 6:22 am #
  
  Perhaps a for-loop and check each word?
  
  Reply
saravana July 27, 2018 at 5:36 am #

Hi Jason
Very nice tutorial. A couple of things that I have used furthermore are the pattern library’s spelling module as well as autocorrect. Furthermore depending on the problem statement you have, an NER filtering also can be applied (using spacy or other packages that are out there) ..

I had a question to get your input on Topic Modelling, was curious if you recommend more steps for cleaning.

One request I had was potentially a tutorial from you on unsupervised text for topic modelling (either for dimension reduction or for clustering using techniques like LDA etc) please 🙂

Reply
- Jason Brownlee July 27, 2018 at 5:57 am #
  
  Nice tips!
  
  Thanks for the suggestion.
  
  Reply
GuyNa July 29, 2018 at 9:45 pm #

i have a problem with the Stem Words part
i can’t find a way that usa and u.s.a will be recognized as the same word

Reply
- Jason Brownlee July 30, 2018 at 5:47 am #
  
  You might need to remove punctuation first.
  
  Reply
  - Deepali Verma August 8, 2020 at 8:25 am #
    
    sir why we remove punctuations in text cleaning..what if we use punctuations ?
    
    Reply
    - Jason Brownlee August 8, 2020 at 1:06 pm #
      
      Good question.
      
      It makes the text much simpler to model. E.g. we can focus on just the consequential words.
      
      Reply
nitesh July 29, 2018 at 10:20 pm #

Does word embedding models like word2vec and gloVe deal with slang words that commonly occur in texts scraped from twitter and other messaging platforms? If not is there a way to easily make a lookup table of these slang words or is there some other method to deal with these words?

Reply
- Jason Brownlee July 30, 2018 at 5:47 am #
  
  If they are in the source text and used in the same way.
  
  Reply
Nil August 2, 2018 at 12:28 am #

Hi DR. Jason,

Thank you for this post it is very helpful.

I have a question, I am learning NLP on Machine Learning Mastery posts and I am trying to practice on binary classification and I have 116 negative class files and 4,396 positive class files. The doubt is, should I reduce the 4,396 positive class files to 116 in order to match the 116 negative class files? to equilibrate the number of negative class files with the number of positive class files? Or should It is not necessary to match the number of negative and positive class files?

I hope DR. Jason may help me on this if you can.

Best Regards

Reply
- Jason Brownlee August 2, 2018 at 6:00 am #
  
  I have some suggestions here:
  https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/
  
  Reply
  - Nil August 2, 2018 at 6:46 pm #
    
    Thank you DR. Jason I will study it.
    
    Best Regards.
    
    Reply
Bo Peng August 15, 2018 at 2:16 pm #

Hi Jason,

Thanks for the post. I feel like if we are preprocessing a large batch of text inputs, running each string in the batch of strings through this whole process could be time consuming, esp. in a production product. Is there a faster way to do all of these steps in terms of computational speed?

Thanks for the consideration.

Reply
- Jason Brownlee August 16, 2018 at 6:01 am #
  
  Yes, once you have defined the vocab and the transforms, you can process new text in parallel. E.g. sentence-wise, paragraph-wise, document-wise, etc.
  
  Reply
Matthew August 17, 2018 at 11:49 pm #

Jason,

Thanks for this post! It is very helpful.

I am looking to build a NLP network to group/connect scientific papers in a library that I have been compiling based on content similarity. I have achieved this already with relatively simple word parsers and Jaccard Similarity metrics. However, I think I could make my output a bit more accurate with the help of WordNet (VerbNet). That is, if these packages can handle “non-words” (i.e. industry-specific jargon) — which several of these papers contain (e.g. optogenetics, nanoparticle, etc.).

Can these packages handle “non-words” in a way that will continue to give these words the weight/context that is reflected in the original text?

Thanks!

Reply
- Jason Brownlee August 18, 2018 at 5:38 am #
  
  Yes.
  
  Reply
  - Simna Ashraf June 7, 2020 at 3:25 pm #
    
    I have to eliminate similar texts from a Twitter dataset.. What to do?
    
    Reply
    - Jason Brownlee June 8, 2020 at 6:02 am #
      
      Perhaps select a text similarity metric, then use it to find pairs of text that are similar and remove some.
      
      Or check the literature to see how other people address the same problem.
      
      Reply
Zark September 10, 2018 at 8:28 pm #

Please remove that cock crotch picture. It is disturbing
thanks

Reply
- Jason Brownlee September 11, 2018 at 6:28 am #
  
  Sorry. Done.
  
  Reply
Vijayalaxmi September 28, 2018 at 12:31 pm #

Hi Jason,

Could you please provide the text file “metamorphosis_clean.txt’.

Reply
- Jason Brownlee September 28, 2018 at 3:00 pm #
  
  You can create it from the raw text data.
  
  What problem are you having exactly?
  
  Reply
aniyfans October 2, 2018 at 12:37 am #

thank you so much…..

Reply
- Jason Brownlee October 2, 2018 at 6:25 am #
  
  You’re welcome.
  
  Reply
MUGDHA BHATNAGAR October 3, 2018 at 2:48 pm #

Hey Jason,nice post!

Can you tell me how to treat short cut words (like bcz,u,thr) in Python? What code to use?

Reply
- Jason Brownlee October 3, 2018 at 4:15 pm #
  
  Perhaps try using a pre-trained word embedding that includes them?
  Perhaps remove them?
  Perhaps develop your own embedding if you have enough data?
  
  Reply
Naveen S October 15, 2018 at 3:57 pm #

You are awesome!! I can rate 5 out of 5 for your explanation

Reply
- Jason Brownlee October 16, 2018 at 6:33 am #
  
  Thanks.
  
  Reply
vinod December 11, 2018 at 10:55 pm #

hi such good tutorial, i have question i have text data one big row holding these patteren data,1 product 60 values or 70 values and 100 values. after the values there is 8 empty spaces, then there is integer and text data of 10 rows. how to pivot this rows into coulmns .

Reply
- Jason Brownlee December 12, 2018 at 5:54 am #
  
  Perhaps load it into memory, transform it, then save in the new format?
  
  Numpy can transpose using the .T attribute on the array.
  
  Reply
walid January 21, 2019 at 7:10 pm #

HI Jason

it was very helpful, I have a question please. I have french text and when editing it , I have the text written correctly with the “accente you know ‘é’ and ‘ô’ for instance), but when I import the text and make the first split based on space i obtain strange characters instead of these french letters, do you know what is the problem please?

cordially
walid

Reply
- Jason Brownlee January 22, 2019 at 6:21 am #
  
  You might need to update clean text procedures in the post to correctly support unicode characters.
  
  Reply
Niket January 22, 2019 at 2:02 pm #

How will you treat text having short cut words (like bcz u thr etc…) in text mining?
how we can treat the above problem in R and python.

Reply
- Jason Brownlee January 23, 2019 at 8:41 am #
  
  If you have enough data, you can learn how they relate to each other in the distributed representation.
  
  If not, perhaps a manual mapping, or drop them.
  
  Reply
- zz August 24, 2021 at 2:43 am #
  
  use contractions library to fix slang words. or define your own dictionary
  
  Reply
usman altaf January 29, 2019 at 7:45 pm #

for sentimental analysis data cleaning was required??? i have a data set of different comments
how can i clean my data?

Reply
- Jason Brownlee January 30, 2019 at 8:07 am #
  
  First think about the types of data cleaning that might be useful for your dataset. Then apply them.
  
  If you’re unsure, perhaps test a few approaches, review the output perhaps even model the output and compare the results.
  
  Reply
MJ February 7, 2019 at 1:54 pm #

This course is incredible. I signed up for the 7 day course, and i am going to buy the book. I love this.

Reply
- Jason Brownlee February 7, 2019 at 2:08 pm #
  
  Thanks!
  
  Reply
amine February 7, 2019 at 8:22 pm #

+1

Reply
- Jason Brownlee February 8, 2019 at 7:46 am #
  
  Thanks.
  
  Reply
Kim July 11, 2019 at 4:39 am #

The large ad for starting the course on ML that appears on every printed page is horrible advertising! I’ve referenced your posts many times on my twitter feed, but no more unless this is corrected! 🙁

Reply
- Jason Brownlee July 11, 2019 at 9:51 am #
  
  What large ad are you referring to exactly?
  
  For printing pages, perhaps this will help:
  https://machinelearningmastery.com/faq/single-faq/how-can-i-print-a-tutorial-page-without-the-sign-up-form
  
  Reply
Tanuja July 18, 2019 at 5:01 pm #

how do we extract just the important keywords from a given paragrapf?

Reply
- Jason Brownlee July 19, 2019 at 9:08 am #
  
  I think this is a very hard problem.
  
  How do you define important?
  
  Reply
Felipe August 7, 2019 at 10:34 am #

Thanks for post it.

Reply
- Jason Brownlee August 7, 2019 at 2:19 pm #
  
  Thanks, I’m glad it helped.
  
  Reply
preethi September 17, 2019 at 7:13 pm #

Hi all,

i have one question

i am writing script in pyhton

with codecs.open(path_inc, ‘r’, ‘utf-8′, errors=’ignore’) as file_handle:
for lineinc in file_handle:
if flag1 == 0:
flag1 = 1
continue
#print (lineinc)
word1 = lineinc.split()
if first == 0: #to store the current date we are doing first == 0
#print(word1[0])
current_date = word1[0]
#print(current_date)
first = 1
if current_date == word1[0]:
#print(word1[:0])
#print(lineinc)
for item in Incidentnumberlist:
if word1[5] == item:
f1.write(“\nIncident ” + item + ” present in .inc file”)
else:
break
above i mentioned my code

this will print the output
Incident 11-5171 present in .inc file
Incident 11-5171 present in .inc file
Incident 10-0210 present in .inc file
Incident 10-0210 present in .inc file
Incident 10-0210 present in .inc file
Incident 10-0210 present in .inc file
Incident 10-0210 present in .inc file

it is printing repeated lines

i dont want repeated lines how to remove them

Reply
- Jason Brownlee September 18, 2019 at 5:59 am #
  
  This is a common question that I answer here:
  https://machinelearningmastery.com/faq/single-faq/can-you-read-review-or-debug-my-code
  
  Reply
Nathan George October 24, 2019 at 8:43 am #

Wouldn’t it be way faster to use translate on the entire text at once, rather than word-by-word?

Reply
- Jason Brownlee October 24, 2019 at 2:02 pm #
  
  Perhaps.
  
  Reply
Amanda December 2, 2019 at 3:39 am #

This is helpful, thanks for sharing!

I’m pretty new to python, but this made it easy to understand. Is there another step to export the new file after cleaning? Thanks in advance.

Reply
- Jason Brownlee December 2, 2019 at 6:09 am #
  
  Thanks, I’m happy it helped!
  
  You can save the array to file directly, e.g. savetext:
  https://docs.scipy.org/doc/numpy/reference/generated/numpy.savetxt.html
  
  Reply
minal February 5, 2020 at 3:39 pm #

‘\ufeffOne morning, when Gregor Samsa woke from troubled dreams, he found\r\nhimself transformed in his bed into a horrible vermin.’
how to remove feff from the text , even replace aint working neither encoding.please help

Reply
- Jason Brownlee February 6, 2020 at 8:18 am #
  
  Perhaps filter out non ascii chars from the text.
  
  Reply
lisa February 13, 2020 at 11:11 am #

this is a super easy-to-follow and useful tutorial–many thanks.

Reply
- Jason Brownlee February 13, 2020 at 1:22 pm #
  
  Thanks!
  
  Reply
pooja March 27, 2020 at 5:48 pm #

how can i find unique sms templates using ML

Reply
- Jason Brownlee March 28, 2020 at 6:14 am #
  
  Sorry, I don’t understand. What is an SMS template and how does machine learning help exactly?
  
  Reply
Dilshan Manathunge April 17, 2020 at 9:03 pm #

how would you recommend handling large Text documents?

Reply
- Jason Brownlee April 18, 2020 at 5:50 am #
  
  Like too large to fit into memory? You can get large machines these days in ec2…
  
  I would use progressive loading to walk the doc process line by line then output the processed data.
  
  Reply
Poppy April 25, 2020 at 3:34 am #

Thank you Jason, excellent post! I have a quick question: is it always a good practice to start text preprocessing from tokenization? I am assuming yes but need validation because the articles that I have been reading about mining social media text many of them seem to start with text normalization (e.g. convert into lowercases, remove slangs, users ids, whitespaces, convert urls with the term url , create ad hoc dictionary to replace medical jargons), and then proceed to tokenization. If a I was about to collect data from an online forum would be better to start with tokenization then proceed with the above text normalization techniques? Thank you in advance!

Reply
- Jason Brownlee April 25, 2020 at 7:02 am #
  
  Start with choosing a vocab/cleaning, then tokenize.
  
  Reply
Isak Imanuel Leong April 26, 2020 at 5:32 am #

Hi Jason, My name is Isak and I from Indonesia. So, sorry for my bad English.

I am currently working on a thesis about usage of text mining in the complaint management system application. The method I use is SVM with the One vs Rest (One vs All) approach because the number of output classes is 6.

The aim is to classify the students’ aspirational texts. Classification results are used to determine the unit or department on campus that can follow up on a aspiration or a complaint.

What I want to ask is about the stages of text normalization in the preprocessing process.

For example for the word “slow” in the text of aspirations about internet connection.

In Indonesian the word “slow” in comments about the internet can be translated as “lamban”, “lambat”. People in Indonesia also use words like “lag”.

Is it necessary to change / uniform the form of words that have the same meaning?
For example: changing the words “lamban”, “lambat”, “lag” to 1 word (just “lambat”)

The case example above is a real example that occurred in my dataset. The use of the words: “lamban”, “lambat”, and “lag” makes me confused when doing preprocessing.

Thank you in advance!!

Reply
- Jason Brownlee April 26, 2020 at 6:21 am #
  
  No, with enough examples the model will have sufficient context to tell the difference between different word usages.
  
  Hand crafted fixes like this are very fragile in general.
  
  Reply
Sergey June 26, 2020 at 3:00 am #

Hi, Jason!
Thanks for your labor! What do you think, does it generally make sense to replace rare specific symbols i.e. ﬁ ’ ○ (\ufb01 \u2019 \u25cb) with simpler analogs i.e. fi, ‘, *? Any recommendations on this?

Kind regards.

Reply
- Jason Brownlee June 26, 2020 at 5:37 am #
  
  Perhaps try it vs removing them completely, fit a model on each and see which performs better.
  
  Reply
mossy.fighting July 18, 2020 at 1:58 am #

The link to download text file has died (I guess, coz i can not access) Metamorphosis by Franz Kafka Plain Text UTF-8. Could you provide another way to obtain the text file. Thank you!

Reply
- Jason Brownlee July 18, 2020 at 6:04 am #
  
  Here is the direct link from the post, it works fine:
  http://www.gutenberg.org/cache/epub/5200/pg5200.txt
  
  Reply
Ali August 17, 2020 at 11:24 pm #

Jason, thanks for your sharing.

I need your advice:
What is the best way to remove header and\or footer, while extracting text from PDF?

Note: First, I converted all pdf files to text using XpdfReader, then imported them to python.

Best.

Reply
- Jason Brownlee August 18, 2020 at 6:04 am #
  
  No idea, perhaps experiment with a few methods.
  
  Reply
Maria November 15, 2020 at 1:18 am #

Hi Jason,

Thank you for this very informative page. I have conducted this exercise, however how will I go about saving the output in a csv file? Let’s say I have loaded a CSV file in python with pandas, and applied these techniques to one of the columns – how will now go about to save these changes and export to CSV? Thank you

Reply
- Jason Brownlee November 15, 2020 at 6:28 am #
  
  This will help you save an array to a file:
  https://machinelearningmastery.com/how-to-save-a-numpy-array-to-file-for-machine-learning/
  
  Reply
Zach December 9, 2020 at 12:56 am #

Is there any post to help with Transliteration of characters from other languages into English as I’ve attempted to use googletrans to translate it but it is unreliable as it sometimes doesn’t actually translate and other times gives and error. As well as transdetect which I used to detect the language and delete if it doesn’t equal to “en”. Neither of these methods worded. Please let me know if there is any reliable way

Reply
- Jason Brownlee December 9, 2020 at 6:27 am #
  
  Perhaps start here:
  https://machinelearningmastery.com/introduction-neural-machine-translation/
  
  And here:
  https://machinelearningmastery.com/develop-neural-machine-translation-system-keras/
  
  Reply
Archana January 5, 2021 at 2:43 am #

Hi Jason
I am trying to execute the below code,but getting error as “no display name and no $DISPLAY environment variable”.Can you plz tell me what I am missing.The error is in the last line “named_entity.draw()”.

Here is the code:
text= “the taj mahal was built by emperor shah jahan”
words=nltk.word_tokenize(text)
tagged_word_name=nltk.pos_tag(words)
named_entity=nltk.ne_chunk(tagged_word_name)
named_entity.draw()

Reply
- Jason Brownlee January 5, 2021 at 6:28 am #
  
  Sorry to hear that, perhaps try posting your code and error on stackoverflow.com.
  
  Reply
Jude Leonard March 13, 2021 at 5:36 pm #

I’ve only tried using Regex to extract Unicode characters from text file and storing them separately in a list to be used for some specific task. For someone starting out his journey in NLP just like myself, I would say, this is amazing.

Thanks Jason!

Reply
- Jason Brownlee March 14, 2021 at 5:25 am #
  
  Well done!
  
  You’re very welcome.
  
  Reply
Nitesh Kumar November 1, 2023 at 3:46 pm #

Hey Jason,
I want to ask one question: Is stemming helpful for reducing dimensionality? I asked this question because different sources provide different answers. Some say yes, while others say no. I also asked ChatGPT, and it also said yes. However, when I tried it, the result was no. So, I am still confused.

Reply
- James Carmichael November 2, 2023 at 10:49 am #
  
  Hi Nitesh…We recommend that utilizing a technique that results in the highest accuracy.
  
  Reply
Muse Safayi Gatalo December 8, 2023 at 6:52 am #

very very interesting thank you very much dear!

Reply
- James Carmichael December 8, 2023 at 11:09 am #
  
  Thank you for your support and feedback Muse! We greatly appreciate it!
  
  Reply
David Quartey December 23, 2025 at 12:35 pm #

Great!

Reply

Navigation

How to Clean Text for Machine Learning with Python

Tutorial Overview

Need help with Deep Learning for Text Data?

Metamorphosis by Franz Kafka

Text Cleaning Is Task Specific

Manual Tokenization

1. Load Data

2. Split by Whitespace

3. Select Words

3. Split by Whitespace and Remove Punctuation

4. Normalizing Case

Note

Tokenization and Cleaning with NLTK

1. Install NLTK

2. Split into Sentences

3. Split into Words

4. Filter Out Punctuation

5. Filter out Stop Words (and Pipeline)

6. Stem Words

Additional Text Cleaning Considerations

Tips for Cleaning Text for Word Embedding

Further Reading

Summary

Develop Deep Learning models for Text Data Today!

Develop Your Own Text models in Minutes

Finally Bring Deep Learning to your Natural Language Processing Projects

More On This Topic

146 Responses to How to Clean Text for Machine Learning with Python

Leave a Reply Click here to cancel reply.