How to Prepare Movie Review Data for Sentiment Analysis (Text Classification)

By Jason Brownlee on December 21, 2020 in Deep Learning for Natural Language Processing 48

Text data preparation is different for each problem.

Preparation starts with simple steps, like loading data, but quickly gets difficult with cleaning tasks that are very specific to the data you are working with. You need help as to where to begin and what order to work through the steps from raw data to data ready for modeling.

In this tutorial, you will discover how to prepare movie review text data for sentiment analysis, step-by-step.

After completing this tutorial, you will know:

How to load text data and clean it to remove punctuation and other non-words.
How to develop a vocabulary, tailor it, and save it to file.
How to prepare movie reviews using cleaning and a pre-defined vocabulary and save them to new files ready for modeling.

Kick-start your project with my new book Deep Learning for Natural Language Processing, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Update Oct/2017: Fixed a small bug when skipping non-matching files, thanks Jan Zett.
Update Dec/2017: Fixed a small typo in full example, thanks Ray and Zain.
Update Aug/2020: Updated link to movie review dataset.

How to Prepare Movie Review Data for Sentiment Analysis
Photo by Kenneth Lu, some rights reserved.

Tutorial Overview

This tutorial is divided into 5 parts; they are:

Movie Review Dataset
Load Text Data
Clean Text Data
Develop Vocabulary
Save Prepared Data

Need help with Deep Learning for Text Data?

Take my free 7-day email crash course now (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

1. Movie Review Dataset

The Movie Review Data is a collection of movie reviews retrieved from the imdb.com website in the early 2000s by Bo Pang and Lillian Lee. The reviews were collected and made available as part of their research on natural language processing.

The reviews were originally released in 2002, but an updated and cleaned up version was released in 2004, referred to as “v2.0“.

The dataset is comprised of 1,000 positive and 1,000 negative movie reviews drawn from an archive of the rec.arts.movies.reviews newsgroup hosted at IMDB. The authors refer to this dataset as the “polarity dataset“.

Our data contains 1000 positive and 1000 negative reviews all written before 2002, with a cap of 20 reviews per author (312 authors total) per category. We refer to this corpus as the polarity dataset.

— A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts, 2004.

The data has been cleaned up somewhat, for example:

The dataset is comprised of only English reviews.
All text has been converted to lowercase.
There is white space around punctuation like periods, commas, and brackets.
Text has been split into one sentence per line.

The data has been used for a few related natural language processing tasks. For classification, the performance of classical models (such as Support Vector Machines) on the data is in the range of high 70% to low 80% (e.g. 78%-to-82%).

More sophisticated data preparation may see results as high as 86% with 10-fold cross validation. This gives us a ballpark of low-to-mid 80s if we were looking to use this dataset in experiments on modern methods.

… depending on choice of downstream polarity classifier, we can achieve highly statistically significant improvement (from 82.8% to 86.4%)

— A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts, 2004.

You can download the dataset from here:

Movie Review Polarity Dataset (review_polarity.tar.gz, 3MB)

After unzipping the file, you will have a directory called “txt_sentoken” with two sub-directories containing the text “neg” and “pos” for negative and positive reviews. Reviews are stored one per file with a naming convention cv000 to cv999 for each of neg and pos.

Next, let’s look at loading the text data.

2. Load Text Data

In this section, we will look at loading individual text files, then processing the directories of files.

We will assume that the review data is downloaded and available in the current working directory in the folder “txt_sentoken“.

We can load an individual text file by opening it, reading in the ASCII text, and closing the file. This is standard file handling stuff. For example, we can load the first negative review file “cv000_29416.txt” as follows:

# load one file
filename = 'txt_sentoken/neg/cv000_29416.txt'
# open the file as read only
file = open(filename, 'r')
# read all text
text = file.read()
# close the file
file.close()

# load one file

filename = 'txt_sentoken/neg/cv000_29416.txt'

# open the file as read only

file = open(filename, 'r')

# read all text

text = file.read()

# close the file

file.close()

This loads the document as ASCII and preserves any white space, like new lines.

We can turn this into a function called load_doc() that takes a filename of the document to load and returns the text.

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# load doc into memory

def load_doc(filename):

# open the file as read only

file = open(filename, 'r')

# read all text

text = file.read()

# close the file

file.close()

return text

We have two directories each with 1,000 documents each. We can process each directory in turn by first getting a list of files in the directory using the listdir() function, then loading each file in turn.

For example, we can load each document in the negative directory using the load_doc() function to do the actual loading.

from os import listdir

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# specify directory to load
directory = 'txt_sentoken/neg'
# walk through all files in the folder
for filename in listdir(directory):
	# skip files that do not have the right extension
	if not filename.endswith(".txt"):
		continue
	# create the full path of the file to open
	path = directory + '/' + filename
	# load document
	doc = load_doc(path)
	print('Loaded %s' % filename)

from os import listdir

# load doc into memory

def load_doc(filename):

# open the file as read only

file = open(filename, 'r')

# read all text

text = file.read()

# close the file

file.close()

return text

# specify directory to load

directory = 'txt_sentoken/neg'

# walk through all files in the folder

for filename in listdir(directory):

# skip files that do not have the right extension

if not filename.endswith(".txt"):

continue

# create the full path of the file to open

path = directory + '/' + filename

# load document

doc = load_doc(path)

print('Loaded %s' % filename)

Running this example prints the filename of each review after it is loaded.

...
Loaded cv995_23113.txt
Loaded cv996_12447.txt
Loaded cv997_5152.txt
Loaded cv998_15691.txt
Loaded cv999_14636.txt

...

Loaded cv995_23113.txt

Loaded cv996_12447.txt

Loaded cv997_5152.txt

Loaded cv998_15691.txt

Loaded cv999_14636.txt

We can turn the processing of the documents into a function as well and use it as a template later for developing a function to clean all documents in a folder. For example, below we define a process_docs() function to do the same thing.

from os import listdir

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# load all docs in a directory
def process_docs(directory):
	# walk through all files in the folder
	for filename in listdir(directory):
		# skip files that do not have the right extension
		if not filename.endswith(".txt"):
			continue
		# create the full path of the file to open
		path = directory + '/' + filename
		# load document
		doc = load_doc(path)
		print('Loaded %s' % filename)

# specify directory to load
directory = 'txt_sentoken/neg'
process_docs(directory)

from os import listdir

# load doc into memory

def load_doc(filename):

# open the file as read only

file = open(filename, 'r')

# read all text

text = file.read()

# close the file

file.close()

return text

# load all docs in a directory

def process_docs(directory):

# walk through all files in the folder

for filename in listdir(directory):

# skip files that do not have the right extension

if not filename.endswith(".txt"):

continue

# create the full path of the file to open

path = directory + '/' + filename

# load document

doc = load_doc(path)

print('Loaded %s' % filename)

# specify directory to load

directory = 'txt_sentoken/neg'

process_docs(directory)

Now that we know how to load the movie review text data, let’s look at cleaning it.

3. Clean Text Data

In this section, we will look at what data cleaning we might want to do to the movie review data.

We will assume that we will be using a bag-of-words model or perhaps a word embedding that does not require too much preparation.

Split into Tokens

First, let’s load one document and look at the raw tokens split by white space. We will use the load_doc() function developed in the previous section. We can use the split() function to split the loaded document into tokens separated by white space.

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# load the document
filename = 'txt_sentoken/neg/cv000_29416.txt'
text = load_doc(filename)
# split into tokens by white space
tokens = text.split()
print(tokens)

# load doc into memory

def load_doc(filename):

# open the file as read only

file = open(filename, 'r')

# read all text

text = file.read()

# close the file

file.close()

return text

# load the document

filename = 'txt_sentoken/neg/cv000_29416.txt'

text = load_doc(filename)

# split into tokens by white space

tokens = text.split()

print(tokens)

Running the example gives a nice long list of raw tokens from the document.

...
'years', 'ago', 'and', 'has', 'been', 'sitting', 'on', 'the', 'shelves', 'ever', 'since', '.', 'whatever', '.', '.', '.', 'skip', 'it', '!', "where's", 'joblo', 'coming', 'from', '?', 'a', 'nightmare', 'of', 'elm', 'street', '3', '(', '7/10', ')', '-', 'blair', 'witch', '2', '(', '7/10', ')', '-', 'the', 'crow', '(', '9/10', ')', '-', 'the', 'crow', ':', 'salvation', '(', '4/10', ')', '-', 'lost', 'highway', '(', '10/10', ')', '-', 'memento', '(', '10/10', ')', '-', 'the', 'others', '(', '9/10', ')', '-', 'stir', 'of', 'echoes', '(', '8/10', ')']

...

'years', 'ago', 'and', 'has', 'been', 'sitting', 'on', 'the', 'shelves', 'ever', 'since', '.', 'whatever', '.', '.', '.', 'skip', 'it', '!', "where's", 'joblo', 'coming', 'from', '?', 'a', 'nightmare', 'of', 'elm', 'street', '3', '(', '7/10', ')', '-', 'blair', 'witch', '2', '(', '7/10', ')', '-', 'the', 'crow', '(', '9/10', ')', '-', 'the', 'crow', ':', 'salvation', '(', '4/10', ')', '-', 'lost', 'highway', '(', '10/10', ')', '-', 'memento', '(', '10/10', ')', '-', 'the', 'others', '(', '9/10', ')', '-', 'stir', 'of', 'echoes', '(', '8/10', ')']

Just looking at the raw tokens can give us a lot of ideas of things to try, such as:

Remove punctuation from words (e.g. ‘what’s’).
Removing tokens that are just punctuation (e.g. ‘-‘).
Removing tokens that contain numbers (e.g. ’10/10′).
Remove tokens that have one character (e.g. ‘a’).
Remove tokens that don’t have much meaning (e.g. ‘and’)

Some ideas:

We can filter out punctuation from tokens using the string translate() function.
We can remove tokens that are just punctuation or contain numbers by using an isalpha() check on each token.
We can remove English stop words using the list loaded using NLTK.
We can filter out short tokens by checking their length.

Below is an updated version of cleaning this review.

from nltk.corpus import stopwords
import string

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# load the document
filename = 'txt_sentoken/neg/cv000_29416.txt'
text = load_doc(filename)
# split into tokens by white space
tokens = text.split()
# remove punctuation from each token
table = str.maketrans('', '', string.punctuation)
tokens = [w.translate(table) for w in tokens]
# remove remaining tokens that are not alphabetic
tokens = [word for word in tokens if word.isalpha()]
# filter out stop words
stop_words = set(stopwords.words('english'))
tokens = [w for w in tokens if not w in stop_words]
# filter out short tokens
tokens = [word for word in tokens if len(word) > 1]
print(tokens)

from nltk.corpus import stopwords

import string

# load doc into memory

def load_doc(filename):

# open the file as read only

file = open(filename, 'r')

# read all text

text = file.read()

# close the file

file.close()

return text

# load the document

filename = 'txt_sentoken/neg/cv000_29416.txt'

text = load_doc(filename)

# split into tokens by white space

tokens = text.split()

# remove punctuation from each token

table = str.maketrans('', '', string.punctuation)

tokens = [w.translate(table) for w in tokens]

# remove remaining tokens that are not alphabetic

tokens = [word for word in tokens if word.isalpha()]

# filter out stop words

stop_words = set(stopwords.words('english'))

tokens = [w for w in tokens if not w in stop_words]

# filter out short tokens

tokens = [word for word in tokens if len(word) > 1]

print(tokens)

Running the example gives a much cleaner looking list of tokens

...
'explanation', 'craziness', 'came', 'oh', 'way', 'horror', 'teen', 'slasher', 'flick', 'packaged', 'look', 'way', 'someone', 'apparently', 'assuming', 'genre', 'still', 'hot', 'kids', 'also', 'wrapped', 'production', 'two', 'years', 'ago', 'sitting', 'shelves', 'ever', 'since', 'whatever', 'skip', 'wheres', 'joblo', 'coming', 'nightmare', 'elm', 'street', 'blair', 'witch', 'crow', 'crow', 'salvation', 'lost', 'highway', 'memento', 'others', 'stir', 'echoes']

...

'explanation', 'craziness', 'came', 'oh', 'way', 'horror', 'teen', 'slasher', 'flick', 'packaged', 'look', 'way', 'someone', 'apparently', 'assuming', 'genre', 'still', 'hot', 'kids', 'also', 'wrapped', 'production', 'two', 'years', 'ago', 'sitting', 'shelves', 'ever', 'since', 'whatever', 'skip', 'wheres', 'joblo', 'coming', 'nightmare', 'elm', 'street', 'blair', 'witch', 'crow', 'crow', 'salvation', 'lost', 'highway', 'memento', 'others', 'stir', 'echoes']

We can put this into a function called clean_doc() and test it on another review, this time a positive review.

from nltk.corpus import stopwords
import string

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# turn a doc into clean tokens
def clean_doc(doc):
	# split into tokens by white space
	tokens = doc.split()
	# remove punctuation from each token
	table = str.maketrans('', '', string.punctuation)
	tokens = [w.translate(table) for w in tokens]
	# remove remaining tokens that are not alphabetic
	tokens = [word for word in tokens if word.isalpha()]
	# filter out stop words
	stop_words = set(stopwords.words('english'))
	tokens = [w for w in tokens if not w in stop_words]
	# filter out short tokens
	tokens = [word for word in tokens if len(word) > 1]
	return tokens

# load the document
filename = 'txt_sentoken/pos/cv000_29590.txt'
text = load_doc(filename)
tokens = clean_doc(text)
print(tokens)

from nltk.corpus import stopwords

import string

# load doc into memory

def load_doc(filename):

# open the file as read only

file = open(filename, 'r')

# read all text

text = file.read()

# close the file

file.close()

return text

# turn a doc into clean tokens

def clean_doc(doc):

# split into tokens by white space

tokens = doc.split()

# remove punctuation from each token

table = str.maketrans('', '', string.punctuation)

tokens = [w.translate(table) for w in tokens]

# remove remaining tokens that are not alphabetic

tokens = [word for word in tokens if word.isalpha()]

# filter out stop words

stop_words = set(stopwords.words('english'))

tokens = [w for w in tokens if not w in stop_words]

# filter out short tokens

tokens = [word for word in tokens if len(word) > 1]

return tokens

# load the document

filename = 'txt_sentoken/pos/cv000_29590.txt'

text = load_doc(filename)

tokens = clean_doc(text)

print(tokens)

Again, the cleaning procedure seems to produce a good set of tokens, at least as a first cut.

...
'comic', 'oscar', 'winner', 'martin', 'childs', 'shakespeare', 'love', 'production', 'design', 'turns', 'original', 'prague', 'surroundings', 'one', 'creepy', 'place', 'even', 'acting', 'hell', 'solid', 'dreamy', 'depp', 'turning', 'typically', 'strong', 'performance', 'deftly', 'handling', 'british', 'accent', 'ians', 'holm', 'joe', 'goulds', 'secret', 'richardson', 'dalmatians', 'log', 'great', 'supporting', 'roles', 'big', 'surprise', 'graham', 'cringed', 'first', 'time', 'opened', 'mouth', 'imagining', 'attempt', 'irish', 'accent', 'actually', 'wasnt', 'half', 'bad', 'film', 'however', 'good', 'strong', 'violencegore', 'sexuality', 'language', 'drug', 'content']

...

'comic', 'oscar', 'winner', 'martin', 'childs', 'shakespeare', 'love', 'production', 'design', 'turns', 'original', 'prague', 'surroundings', 'one', 'creepy', 'place', 'even', 'acting', 'hell', 'solid', 'dreamy', 'depp', 'turning', 'typically', 'strong', 'performance', 'deftly', 'handling', 'british', 'accent', 'ians', 'holm', 'joe', 'goulds', 'secret', 'richardson', 'dalmatians', 'log', 'great', 'supporting', 'roles', 'big', 'surprise', 'graham', 'cringed', 'first', 'time', 'opened', 'mouth', 'imagining', 'attempt', 'irish', 'accent', 'actually', 'wasnt', 'half', 'bad', 'film', 'however', 'good', 'strong', 'violencegore', 'sexuality', 'language', 'drug', 'content']

There are many more cleaning steps we could take and I leave them to your imagination.

Next, let’s look at how we can manage a preferred vocabulary of tokens.

4. Develop Vocabulary

When working with predictive models of text, like a bag-of-words model, there is a pressure to reduce the size of the vocabulary.

The larger the vocabulary, the more sparse the representation of each word or document.

A part of preparing text for sentiment analysis involves defining and tailoring the vocabulary of words supported by the model.

We can do this by loading all of the documents in the dataset and building a set of words. We may decide to support all of these words, or perhaps discard some. The final chosen vocabulary can then be saved to file for later use, such as filtering words in new documents in the future.

We can keep track of the vocabulary in a Counter, which is a dictionary of words and their count with some additional convenience functions.

We need to develop a new function to process a document and add it to the vocabulary. The function needs to load a document by calling the previously developed load_doc() function. It needs to clean the loaded document using the previously developed clean_doc() function, then it needs to add all the tokens to the Counter, and update counts. We can do this last step by calling the update() function on the counter object.

Below is a function called add_doc_to_vocab() that takes as arguments a document filename and a Counter vocabulary.

# load doc and add to vocab
def add_doc_to_vocab(filename, vocab):
	# load doc
	doc = load_doc(filename)
	# clean doc
	tokens = clean_doc(doc)
	# update counts
	vocab.update(tokens)

# load doc and add to vocab

def add_doc_to_vocab(filename, vocab):

# load doc

doc = load_doc(filename)

# clean doc

tokens = clean_doc(doc)

# update counts

vocab.update(tokens)

Finally, we can use our template above for processing all documents in a directory called process_docs() and update it to call add_doc_to_vocab().

# load all docs in a directory
def process_docs(directory, vocab):
	# walk through all files in the folder
	for filename in listdir(directory):
		# skip files that do not have the right extension
		if not filename.endswith(".txt"):
			continue
		# create the full path of the file to open
		path = directory + '/' + filename
		# add doc to vocab
		add_doc_to_vocab(path, vocab)

# load all docs in a directory

def process_docs(directory, vocab):

# walk through all files in the folder

for filename in listdir(directory):

# skip files that do not have the right extension

if not filename.endswith(".txt"):

continue

# create the full path of the file to open

path = directory + '/' + filename

# add doc to vocab

add_doc_to_vocab(path, vocab)

We can put all of this together and develop a full vocabulary from all documents in the dataset.

from string import punctuation
from os import listdir
from collections import Counter
from nltk.corpus import stopwords

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# turn a doc into clean tokens
def clean_doc(doc):
	# split into tokens by white space
	tokens = doc.split()
	# remove punctuation from each token
	table = str.maketrans('', '', punctuation)
	tokens = [w.translate(table) for w in tokens]
	# remove remaining tokens that are not alphabetic
	tokens = [word for word in tokens if word.isalpha()]
	# filter out stop words
	stop_words = set(stopwords.words('english'))
	tokens = [w for w in tokens if not w in stop_words]
	# filter out short tokens
	tokens = [word for word in tokens if len(word) > 1]
	return tokens

# load doc and add to vocab
def add_doc_to_vocab(filename, vocab):
	# load doc
	doc = load_doc(filename)
	# clean doc
	tokens = clean_doc(doc)
	# update counts
	vocab.update(tokens)

# load all docs in a directory
def process_docs(directory, vocab):
	# walk through all files in the folder
	for filename in listdir(directory):
		# skip files that do not have the right extension
		if not filename.endswith(".txt"):
			continue
		# create the full path of the file to open
		path = directory + '/' + filename
		# add doc to vocab
		add_doc_to_vocab(path, vocab)

# define vocab
vocab = Counter()
# add all docs to vocab
process_docs('txt_sentoken/neg', vocab)
process_docs('txt_sentoken/pos', vocab)
# print the size of the vocab
print(len(vocab))
# print the top words in the vocab
print(vocab.most_common(50))

from string import punctuation

from os import listdir

from collections import Counter

from nltk.corpus import stopwords

# load doc into memory

def load_doc(filename):

# open the file as read only

file = open(filename, 'r')

# read all text

text = file.read()

# close the file

file.close()

return text

# turn a doc into clean tokens

def clean_doc(doc):

# split into tokens by white space

tokens = doc.split()

# remove punctuation from each token

table = str.maketrans('', '', punctuation)

tokens = [w.translate(table) for w in tokens]

# remove remaining tokens that are not alphabetic

tokens = [word for word in tokens if word.isalpha()]

# filter out stop words

stop_words = set(stopwords.words('english'))

tokens = [w for w in tokens if not w in stop_words]

# filter out short tokens

tokens = [word for word in tokens if len(word) > 1]

return tokens

# load doc and add to vocab

def add_doc_to_vocab(filename, vocab):

# load doc

doc = load_doc(filename)

# clean doc

tokens = clean_doc(doc)

# update counts

vocab.update(tokens)

# load all docs in a directory

def process_docs(directory, vocab):

# walk through all files in the folder

for filename in listdir(directory):

# skip files that do not have the right extension

if not filename.endswith(".txt"):

continue

# create the full path of the file to open

path = directory + '/' + filename

# add doc to vocab

add_doc_to_vocab(path, vocab)

# define vocab

vocab = Counter()

# add all docs to vocab

process_docs('txt_sentoken/neg', vocab)

process_docs('txt_sentoken/pos', vocab)

# print the size of the vocab

print(len(vocab))

# print the top words in the vocab

print(vocab.most_common(50))

Running the example creates a vocabulary with all documents in the dataset, including positive and negative reviews.

We can see that there are a little over 46,000 unique words across all reviews and the top 3 words are ‘film‘, ‘one‘, and ‘movie‘.

46557
[('film', 8860), ('one', 5521), ('movie', 5440), ('like', 3553), ('even', 2555), ('good', 2320), ('time', 2283), ('story', 2118), ('films', 2102), ('would', 2042), ('much', 2024), ('also', 1965), ('characters', 1947), ('get', 1921), ('character', 1906), ('two', 1825), ('first', 1768), ('see', 1730), ('well', 1694), ('way', 1668), ('make', 1590), ('really', 1563), ('little', 1491), ('life', 1472), ('plot', 1451), ('people', 1420), ('movies', 1416), ('could', 1395), ('bad', 1374), ('scene', 1373), ('never', 1364), ('best', 1301), ('new', 1277), ('many', 1268), ('doesnt', 1267), ('man', 1266), ('scenes', 1265), ('dont', 1210), ('know', 1207), ('hes', 1150), ('great', 1141), ('another', 1111), ('love', 1089), ('action', 1078), ('go', 1075), ('us', 1065), ('director', 1056), ('something', 1048), ('end', 1047), ('still', 1038)]

46557

[('film', 8860), ('one', 5521), ('movie', 5440), ('like', 3553), ('even', 2555), ('good', 2320), ('time', 2283), ('story', 2118), ('films', 2102), ('would', 2042), ('much', 2024), ('also', 1965), ('characters', 1947), ('get', 1921), ('character', 1906), ('two', 1825), ('first', 1768), ('see', 1730), ('well', 1694), ('way', 1668), ('make', 1590), ('really', 1563), ('little', 1491), ('life', 1472), ('plot', 1451), ('people', 1420), ('movies', 1416), ('could', 1395), ('bad', 1374), ('scene', 1373), ('never', 1364), ('best', 1301), ('new', 1277), ('many', 1268), ('doesnt', 1267), ('man', 1266), ('scenes', 1265), ('dont', 1210), ('know', 1207), ('hes', 1150), ('great', 1141), ('another', 1111), ('love', 1089), ('action', 1078), ('go', 1075), ('us', 1065), ('director', 1056), ('something', 1048), ('end', 1047), ('still', 1038)]

Perhaps the least common words, those that only appear once across all reviews, are not predictive. Perhaps some of the most common words are not useful too.

These are good questions and really should be tested with a specific predictive model.

Generally, words that only appear once or a few times across 2,000 reviews are probably not predictive and can be removed from the vocabulary, greatly cutting down on the tokens we need to model.

We can do this by stepping through words and their counts and only keeping those with a count above a chosen threshold. Here we will use 5 occurrences.

# keep tokens with > 5 occurrence
min_occurane = 5
tokens = [k for k,c in vocab.items() if c >= min_occurane]
print(len(tokens))

# keep tokens with > 5 occurrence

min_occurane = 5

tokens = [k for k,c in vocab.items() if c >= min_occurane]

print(len(tokens))

This reduces the vocabulary from 46,557 to 14,803 words, a huge drop. Perhaps a minimum of 5 occurrences is too aggressive; you can experiment with different values.

We can then save the chosen vocabulary of words to a new file. I like to save the vocabulary as ASCII with one word per line.

Below defines a function called save_list() to save a list of items, in this case, tokens to file, one per line.

def save_list(lines, filename):
	data = '\n'.join(lines)
	file = open(filename, 'w')
	file.write(data)
	file.close()

def save_list(lines, filename):

data = '\n'.join(lines)

file = open(filename, 'w')

file.write(data)

file.close()

The complete example for defining and saving the vocabulary is listed below.

from string import punctuation
from os import listdir
from collections import Counter
from nltk.corpus import stopwords

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# turn a doc into clean tokens
def clean_doc(doc):
	# split into tokens by white space
	tokens = doc.split()
	# remove punctuation from each token
	table = str.maketrans('', '', punctuation)
	tokens = [w.translate(table) for w in tokens]
	# remove remaining tokens that are not alphabetic
	tokens = [word for word in tokens if word.isalpha()]
	# filter out stop words
	stop_words = set(stopwords.words('english'))
	tokens = [w for w in tokens if not w in stop_words]
	# filter out short tokens
	tokens = [word for word in tokens if len(word) > 1]
	return tokens

# load doc and add to vocab
def add_doc_to_vocab(filename, vocab):
	# load doc
	doc = load_doc(filename)
	# clean doc
	tokens = clean_doc(doc)
	# update counts
	vocab.update(tokens)

# load all docs in a directory
def process_docs(directory, vocab):
	# walk through all files in the folder
	for filename in listdir(directory):
		# skip files that do not have the right extension
		if not filename.endswith(".txt"):
			continue
		# create the full path of the file to open
		path = directory + '/' + filename
		# add doc to vocab
		add_doc_to_vocab(path, vocab)

# save list to file
def save_list(lines, filename):
	data = '\n'.join(lines)
	file = open(filename, 'w')
	file.write(data)
	file.close()

# define vocab
vocab = Counter()
# add all docs to vocab
process_docs('txt_sentoken/neg', vocab)
process_docs('txt_sentoken/pos', vocab)
# print the size of the vocab
print(len(vocab))
# print the top words in the vocab
print(vocab.most_common(50))
# keep tokens with > 5 occurrence
min_occurane = 5
tokens = [k for k,c in vocab.items() if c >= min_occurane]
print(len(tokens))
# save tokens to a vocabulary file
save_list(tokens, 'vocab.txt')

from string import punctuation

from os import listdir

from collections import Counter

from nltk.corpus import stopwords

# load doc into memory

def load_doc(filename):

# open the file as read only

file = open(filename, 'r')

# read all text

text = file.read()

# close the file

file.close()

return text

# turn a doc into clean tokens

def clean_doc(doc):

# split into tokens by white space

tokens = doc.split()

# remove punctuation from each token

table = str.maketrans('', '', punctuation)

tokens = [w.translate(table) for w in tokens]

# remove remaining tokens that are not alphabetic

tokens = [word for word in tokens if word.isalpha()]

# filter out stop words

stop_words = set(stopwords.words('english'))

tokens = [w for w in tokens if not w in stop_words]

# filter out short tokens

tokens = [word for word in tokens if len(word) > 1]

return tokens

# load doc and add to vocab

def add_doc_to_vocab(filename, vocab):

# load doc

doc = load_doc(filename)

# clean doc

tokens = clean_doc(doc)

# update counts

vocab.update(tokens)

# load all docs in a directory

def process_docs(directory, vocab):

# walk through all files in the folder

for filename in listdir(directory):

# skip files that do not have the right extension

if not filename.endswith(".txt"):

continue

# create the full path of the file to open

path = directory + '/' + filename

# add doc to vocab

add_doc_to_vocab(path, vocab)

# save list to file

def save_list(lines, filename):

data = '\n'.join(lines)

file = open(filename, 'w')

file.write(data)

file.close()

# define vocab

vocab = Counter()

# add all docs to vocab

process_docs('txt_sentoken/neg', vocab)

process_docs('txt_sentoken/pos', vocab)

# print the size of the vocab

print(len(vocab))

# print the top words in the vocab

print(vocab.most_common(50))

# keep tokens with > 5 occurrence

min_occurane = 5

tokens = [k for k,c in vocab.items() if c >= min_occurane]

print(len(tokens))

# save tokens to a vocabulary file

save_list(tokens, 'vocab.txt')

Running this final snippet after creating the vocabulary will save the chosen words to file.

It is a good idea to take a look at, and even study, your chosen vocabulary in order to get ideas for better preparing this data, or text data in the future.

hasnt
updating
figuratively
symphony
civilians
might
fisherman
hokum
witch
buffoons
...

hasnt

updating

figuratively

symphony

civilians

might

fisherman

hokum

witch

buffoons

...

Next, we can look at using the vocabulary to create a prepared version of the movie review dataset.

5. Save Prepared Data

We can use the data cleaning and chosen vocabulary to prepare each movie review and save the prepared versions of the reviews ready for modeling.

This is a good practice as it decouples the data preparation from modeling, allowing you to focus on modeling and circle back to data prep if you have new ideas.

We can start off by loading the vocabulary from ‘vocab.txt‘.

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# load vocabulary
vocab_filename = 'review_polarity/vocab.txt'
vocab = load_doc(vocab_filename)
vocab = vocab.split()
vocab = set(vocab)

# load doc into memory

def load_doc(filename):

# open the file as read only

file = open(filename, 'r')

# read all text

text = file.read()

# close the file

file.close()

return text

# load vocabulary

vocab_filename = 'review_polarity/vocab.txt'

vocab = load_doc(vocab_filename)

vocab = vocab.split()

vocab = set(vocab)

Next, we can clean the reviews, use the loaded vocab to filter out unwanted tokens, and save the clean reviews in a new file.

One approach could be to save all the positive reviews in one file and all the negative reviews in another file, with the filtered tokens separated by white space for each review on separate lines.

First, we can define a function to process a document, clean it, filter it, and return it as a single line that could be saved in a file. Below defines the doc_to_line() function to do just that, taking a filename and vocabulary (as a set) as arguments.

It calls the previously defined load_doc() function to load the document and clean_doc() to tokenize the document.

# load doc, clean and return line of tokens
def doc_to_line(filename, vocab):
	# load the doc
	doc = load_doc(filename)
	# clean doc
	tokens = clean_doc(doc)
	# filter by vocab
	tokens = [w for w in tokens if w in vocab]
	return ' '.join(tokens)

# load doc, clean and return line of tokens

def doc_to_line(filename, vocab):

# load the doc

doc = load_doc(filename)

# clean doc

tokens = clean_doc(doc)

# filter by vocab

tokens = [w for w in tokens if w in vocab]

return ' '.join(tokens)

Next, we can define a new version of process_docs() to step through all reviews in a folder and convert them to lines by calling doc_to_line() for each document. A list of lines is then returned.

# load all docs in a directory
def process_docs(directory, vocab):
	lines = list()
	# walk through all files in the folder
	for filename in listdir(directory):
		# skip files that do not have the right extension
		if not filename.endswith(".txt"):
			continue
		# create the full path of the file to open
		path = directory + '/' + filename
		# load and clean the doc
		line = doc_to_line(path, vocab)
		# add to list
		lines.append(line)
	return lines

# load all docs in a directory

def process_docs(directory, vocab):

lines = list()

# walk through all files in the folder

for filename in listdir(directory):

# skip files that do not have the right extension

if not filename.endswith(".txt"):

continue

# create the full path of the file to open

path = directory + '/' + filename

# load and clean the doc

line = doc_to_line(path, vocab)

# add to list

lines.append(line)

return lines

We can then call process_docs() for both the directories of positive and negative reviews, then call save_list() from the previous section to save each list of processed reviews to a file.

The complete code listing is provided below.

from string import punctuation
from os import listdir
from collections import Counter
from nltk.corpus import stopwords

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# turn a doc into clean tokens
def clean_doc(doc):
	# split into tokens by white space
	tokens = doc.split()
	# remove punctuation from each token
	table = str.maketrans('', '', punctuation)
	tokens = [w.translate(table) for w in tokens]
	# remove remaining tokens that are not alphabetic
	tokens = [word for word in tokens if word.isalpha()]
	# filter out stop words
	stop_words = set(stopwords.words('english'))
	tokens = [w for w in tokens if not w in stop_words]
	# filter out short tokens
	tokens = [word for word in tokens if len(word) > 1]
	return tokens

# save list to file
def save_list(lines, filename):
	data = '\n'.join(lines)
	file = open(filename, 'w')
	file.write(data)
	file.close()

# load doc, clean and return line of tokens
def doc_to_line(filename, vocab):
	# load the doc
	doc = load_doc(filename)
	# clean doc
	tokens = clean_doc(doc)
	# filter by vocab
	tokens = [w for w in tokens if w in vocab]
	return ' '.join(tokens)

# load all docs in a directory
def process_docs(directory, vocab):
	lines = list()
	# walk through all files in the folder
	for filename in listdir(directory):
		# skip files that do not have the right extension
		if not filename.endswith(".txt"):
			continue
		# create the full path of the file to open
		path = directory + '/' + filename
		# load and clean the doc
		line = doc_to_line(path, vocab)
		# add to list
		lines.append(line)
	return lines

# load vocabulary
vocab_filename = 'vocab.txt'
vocab = load_doc(vocab_filename)
vocab = vocab.split()
vocab = set(vocab)
# prepare negative reviews
negative_lines = process_docs('txt_sentoken/neg', vocab)
save_list(negative_lines, 'negative.txt')
# prepare positive reviews
positive_lines = process_docs('txt_sentoken/pos', vocab)
save_list(positive_lines, 'positive.txt')

from string import punctuation

from os import listdir

from collections import Counter

from nltk.corpus import stopwords

# load doc into memory

def load_doc(filename):

# open the file as read only

file = open(filename, 'r')

# read all text

text = file.read()

# close the file

file.close()

return text

# turn a doc into clean tokens

def clean_doc(doc):

# split into tokens by white space

tokens = doc.split()

# remove punctuation from each token

table = str.maketrans('', '', punctuation)

tokens = [w.translate(table) for w in tokens]

# remove remaining tokens that are not alphabetic

tokens = [word for word in tokens if word.isalpha()]

# filter out stop words

stop_words = set(stopwords.words('english'))

tokens = [w for w in tokens if not w in stop_words]

# filter out short tokens

tokens = [word for word in tokens if len(word) > 1]

return tokens

# save list to file

def save_list(lines, filename):

data = '\n'.join(lines)

file = open(filename, 'w')

file.write(data)

file.close()

# load doc, clean and return line of tokens

def doc_to_line(filename, vocab):

# load the doc

doc = load_doc(filename)

# clean doc

tokens = clean_doc(doc)

# filter by vocab

tokens = [w for w in tokens if w in vocab]

return ' '.join(tokens)

# load all docs in a directory

def process_docs(directory, vocab):

lines = list()

# walk through all files in the folder

for filename in listdir(directory):

# skip files that do not have the right extension

if not filename.endswith(".txt"):

continue

# create the full path of the file to open

path = directory + '/' + filename

# load and clean the doc

line = doc_to_line(path, vocab)

# add to list

lines.append(line)

return lines

# load vocabulary

vocab_filename = 'vocab.txt'

vocab = load_doc(vocab_filename)

vocab = vocab.split()

vocab = set(vocab)

# prepare negative reviews

negative_lines = process_docs('txt_sentoken/neg', vocab)

save_list(negative_lines, 'negative.txt')

# prepare positive reviews

positive_lines = process_docs('txt_sentoken/pos', vocab)

save_list(positive_lines, 'positive.txt')

Running the example saves two new files, ‘negative.txt‘ and ‘positive.txt‘, that contain the prepared negative and positive reviews respectively.

The data is ready for use in a bag-of-words or even word embedding model.

Extensions

This section lists some extensions that you may wish to explore.

Stemming. We could reduce each word in documents to their stem using a stemming algorithm like the Porter stemmer.
N-Grams. Instead of working with individual words, we could work with a vocabulary of word pairs, called bigrams. We could also investigate the use of larger groups, such as triplets (trigrams) and more (n-grams).
Encode Words. Instead of saving tokens as-is, we could save the integer encoding of the words, where the index of the word in the vocabulary represents a unique integer number for the word. This will make it easier to work with the data when modeling.
Encode Documents. Instead of saving tokens in documents, we could encode the documents using a bag-of-words model and encode each word as a boolean present/absent flag or use more sophisticated scoring, such as TF-IDF.

If you try any of these extensions, I’d love to know.
Share your results in the comments below.

Summary

In this tutorial, you discovered how to prepare movie review text data for sentiment analysis, step-by-step.

Specifically, you learned:

How to load text data and clean it to remove punctuation and other non-words.
How to develop a vocabulary, tailor it, and save it to file.
How to prepare movie reviews using cleaning and a predefined vocabulary and save them to new files ready for modeling.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

48 Responses to How to Prepare Movie Review Data for Sentiment Analysis (Text Classification)

Alexander October 16, 2017 at 6:42 pm #

Thank you, Jason. Very interest work. Tell me please, how can we implement N-Grams extension? Can we use some pre-trained models here, like GloVe?

Reply
- Jason Brownlee October 17, 2017 at 5:40 am #
  
  I hope to have an example on the blog soon.
  
  Reply
Alexander October 17, 2017 at 5:11 pm #

Thank you.

Reply
Lin Li October 20, 2017 at 1:20 am #

Thank you, Dr.Jason.
I’ve used build-in function in keras to load IMDB dataset. That is “(X_train, y_train),(X_test, y_test) = imdb.load_data()” by “from keras.datasets import imdb”. I’m confused that what’s the differences between the IMDB dataset I’ve loaded with “imdb.load_data()” and the IMDB dataset you used in this post? The former contains 25,000 highly-polar movie reviews, and the latter contains only 2,000 reviews. So what is the IMDB dataset exactly? If I want to construct a deep learning model to do sentiment analysis, what dataset should I use?
I’m looking forward to your reply. Thanks!

Reply
- Jason Brownlee October 20, 2017 at 5:42 am #
  
  They are different datasets, both intended for educational purposes only – e.g. learning how to develop models.
  
  I would recommend collecting data that is representative of the problem that you are trying to solve.
  
  Reply
  - Li Lin October 22, 2017 at 2:09 am #
    
    Thank you for your reply!
    When I use the built-in function to load the IMDB dataset, it has already been pre-processed, the words has been represented as integer index. Is there any way to get the raw data?
    
    Reply
    - Jason Brownlee October 22, 2017 at 5:31 am #
      
      Perhaps here:
      http://ai.stanford.edu/~amaas/data/sentiment/
      
      Reply
Alexander October 20, 2017 at 7:06 pm #

Jason, help me please.
If we develop LSTM RNN with Embedding layer, can the network learn the relationships between words?

Reply
- Jason Brownlee October 21, 2017 at 5:28 am #
  
  Yes. The embedding itself will learn representations about how words are used.
  
  An LSTM can learn about the importance of words in different positions, depending on the application.
  
  Do you have a specific domain in mind?
  
  Reply
Jan Zett October 21, 2017 at 12:44 am #

Hey Jason, thank you for your great work. I really like your blog and already learned a lot!
I’m not sure if you noticed, but there is a tiny bug in your code. It’s not that important but when you’re trying to skip files in your directory which do not end on .txt you use next instead of continue. Which doesn’t have the desired effect in this context.

Reply
- Jason Brownlee October 21, 2017 at 5:42 am #
  
  Thanks Jan, fixed! I guess I was thinking in Ruby or something…
  
  Reply
Alexander October 21, 2017 at 5:52 am #

Thank for feedback, Jason. It is very interest. I try to understand. Try to project these ideas on different domains…. thank for inspiration.

Reply
- Jason Brownlee October 22, 2017 at 5:14 am #
  
  Thanks Alexander.
  
  Reply
Debendra October 22, 2017 at 2:19 am #

Thanks a ton for such post.. it will help a lot for those who are reskilling to data science.

Reply
- Jason Brownlee October 22, 2017 at 5:32 am #
  
  Thanks Debendra.
  
  Reply
Vengadesan Nammalvar October 24, 2017 at 10:13 am #

you simply ignited to many growing machine learning professionals to reach their career goal.

Reply
- Jason Brownlee October 24, 2017 at 3:59 pm #
  
  I’m glad to hear that.
  
  Reply
Ray November 10, 2017 at 4:54 am #

Hi Jason, your works and example are always detailed and useful. You are one in a thousand teacher. I have found your examples thorough, useful and transferable.

Note that LINE 74 in the complete code is missing a colon….just something to note for those copying and pasting to run locally.

Reply
- Jason Brownlee November 10, 2017 at 10:42 am #
  
  Thanks Ray!
  
  You mean this line:
  
  positive_lines = process_docs(txt_sentoken/pos', vocab)
  
  1
  
  positive_lines = process_docs(txt_sentoken/pos', vocab)
  
  If so, how is it missing a colon?
  
  Reply
  - Zain December 19, 2017 at 9:28 am #
    
    Ray is actually referring to missing quote before txt_sentoken/pos’
    
    Reply
    - Jason Brownlee December 19, 2017 at 3:58 pm #
      
      Fixed, thanks guys!
      
      Reply
    - Zain December 20, 2017 at 5:37 am #
      
      and line 46 should be
      tokens = [w for w in tokens if w not in vocab]
      
      Thanks for putting up these great tutorials.. they really help!
      
      Reply
      - Jason Brownlee December 20, 2017 at 5:54 am #
        
        I don’t think so. We are trying to only keep words from doc that are in vocab.
mohit January 31, 2018 at 9:25 pm #

i want this code for python 2.7

Reply
- Jason Brownlee February 1, 2018 at 7:21 am #
  
  Try it.
  
  Reply
Neeraj joon February 1, 2018 at 7:06 pm #

hey jason all i want to do is . raw_input a review and the code return a single word that it,s negative or positive . i searched whole internet can’t find it. can this code do that with little editing if not where can i find this kind of code.

Reply
- Jason Brownlee February 2, 2018 at 8:11 am #
  
  You could adapt it to do that. See this post:
  https://machinelearningmastery.com/develop-word-embedding-model-predicting-movie-review-sentiment/
  
  Reply
HaRRy May 23, 2018 at 1:01 am #

hi dr Jason…i’m kind a newbie in data science.currently, im doing a project in rapid miner using search twitter and sentiment analysis…im trying to find a way to prove that marvel movies is better than dc movies and also im trying to extract new attributes from the data that been collected. for example, what kinds of words (common words) that used to describe the avengers. what are the word that used to describe the positive, negative, neutral. so far..i have no idea how to do that…i already collected the data using the seacrh twitter and sentiment analysis…but the later part..is a puzzler…can you please help me

Reply
- Jason Brownlee May 23, 2018 at 6:28 am #
  
  Sounds like fun.
  
  Sorry, I don’t have good suggestions for collecting twitter data.
  
  Reply
ADiNoS June 7, 2018 at 3:54 am #

Hey Jason Brownlee, thank you for your great work.i’m thankful.

Reply
- Jason Brownlee June 7, 2018 at 6:34 am #
  
  Thanks.
  
  Reply
Saji September 24, 2018 at 9:34 am #

Hello Jason , Thanks for you great work. But I have a question for now, I will be going to create my project which involves auto text classification for documents. Now my problem is the project that I will be creating has a dynamically defined categories. Which means if I will have dataset for 5 categories now, then if new categories will be added I have to add another dataset for that. What I want is my project will automatically adopt the new categories without adding additional dataset for new categories. Thank you

Reply
- Jason Brownlee September 24, 2018 at 2:09 pm #
  
  I’m not sure off hand, that may require some very careful design.
  
  Reply

Katherine Munro February 13, 2019 at 7:24 am #

Hi Jason,

amazing work as always. I’m surprised no-one has commented on this but once you change your process_docs method to load a pre-made doc, you lose the opportunity to create a new vocabulary. I guess that’s why the code from the end of your tutorial works for me but vocab size is all 0 (unless I have some other problem).

Anyway I added a function get_corpus_vocab which is basically your version of process_docs from back when it could still be used to build a new vocabulary. Maybe it would be good to add? I can add/send the full code if you like.

Katherine

# Get entire corpus vocab
def get_corpus_vocab(directory, vocab):
    # walk through all files in the folder
    for filename in listdir(directory):
        # skip files that do not have the right extension
        if not filename.endswith(".txt"):
            continue
        # create the full path of the file to open
        path = directory + '/' + filename
        # add doc to vocab
        add_doc_to_vocab(path, vocab)

And then all the methods are called with the following lines:

# Define vocab counter
vocab = Counter()
get_corpus_vocab('review_polarity/txt_sentoken/neg', vocab)
get_corpus_vocab('review_polarity/txt_sentoken/pos', vocab)
print("Vocab length: ", len(vocab), "and top 20 words:")
print(vocab.most_common(20))

# keep vocab tokens with > 5 occurrences
min_occurrence = 5
tokens = [k for k,c in vocab.items() if c >= min_occurrence]
print("Vocab length after filtering for num occurrences: ", len(tokens))

# Save vocab
save_list(tokens, 'review_polarity/txt_sentoken/vocab2.txt')

# load vocabulary
vocab_filename = 'review_polarity/txt_sentoken/vocab2.txt'
vocab = load_doc(vocab_filename)
vocab = vocab.split()
vocab = set(vocab)

# prepare negative reviews
negative_lines = process_docs('review_polarity/txt_sentoken/neg', vocab)
save_list(negative_lines, 'review_polarity/negative2.txt')
# prepare positive reviews
positive_lines = process_docs('review_polarity/txt_sentoken/pos', vocab)
save_list(positive_lines, 'review_polarity/positive2.txt')

# Get entire corpus vocab

def get_corpus_vocab(directory, vocab):

# walk through all files in the folder

for filename in listdir(directory):

# skip files that do not have the right extension

if not filename.endswith(".txt"):

continue

# create the full path of the file to open

path = directory + '/' + filename

# add doc to vocab

add_doc_to_vocab(path, vocab)

And then all the methods are called with the following lines:

# Define vocab counter

vocab = Counter()

get_corpus_vocab('review_polarity/txt_sentoken/neg', vocab)

get_corpus_vocab('review_polarity/txt_sentoken/pos', vocab)

print("Vocab length: ", len(vocab), "and top 20 words:")

print(vocab.most_common(20))

# keep vocab tokens with > 5 occurrences

min_occurrence = 5

tokens = [k for k,c in vocab.items() if c >= min_occurrence]

print("Vocab length after filtering for num occurrences: ", len(tokens))

# Save vocab

save_list(tokens, 'review_polarity/txt_sentoken/vocab2.txt')

# load vocabulary

vocab_filename = 'review_polarity/txt_sentoken/vocab2.txt'

vocab = load_doc(vocab_filename)

vocab = vocab.split()

vocab = set(vocab)

# prepare negative reviews

negative_lines = process_docs('review_polarity/txt_sentoken/neg', vocab)

save_list(negative_lines, 'review_polarity/negative2.txt')

# prepare positive reviews

positive_lines = process_docs('review_polarity/txt_sentoken/pos', vocab)

save_list(positive_lines, 'review_polarity/positive2.txt')

Jason Brownlee February 13, 2019 at 8:06 am #

Very cool, thanks for sharing!

Reply

Shyju March 1, 2019 at 6:49 pm #

Thanks for the article.

Which tools/methods/models can be used to I infer some useful informations for an event organizer based on the customer reviews

Reply
- Jason Brownlee March 2, 2019 at 9:30 am #
  
  Perhaps look into text mining tools to extract more than just sentiment?
  
  Reply
Data_enthusiast April 13, 2019 at 4:25 pm #

What do we do with test data set? The data set into two parts train and test. if i load train data set and further split it into two sets for training model then how to use test data set? Can please explain and help?

Reply
- Jason Brownlee April 14, 2019 at 5:43 am #
  
  Do you mean in general, or do you mean in this tutorial specifically?
  
  In this tutorial, I show exactly how to load and handle the data.
  
  Reply
Teerth August 6, 2019 at 8:44 pm #

Thank you Jason for this amazing tutorial. I am after the movie system based on the sentimental comments. Do you have another tutorials for training, classifying (Naive based) and predicting data?
That would be really helpful.
Cheers

Reply
- Jason Brownlee August 7, 2019 at 7:52 am #
  
  Many, perhaps start here:
  https://machinelearningmastery.com/start-here/
  
  Reply
TEERTH GATECHA August 9, 2019 at 5:04 pm #

I am sorry but could you be more specific, please. I am so confused. I have to make an online movie based sentimental system and I am stuck after data pre-processing.

Reply
- Jason Brownlee August 10, 2019 at 7:11 am #
  
  Perhaps the above tutorial would provide a good template for your project?
  
  Reply
TEERTH GATECHA August 9, 2019 at 5:45 pm #

So I have 2 files now positive.txt and negative.txt and what next?

Reply
- Jason Brownlee August 10, 2019 at 7:11 am #
  
  Perhaps try loading them into memory.
  
  Reply
Wendy T Alphin January 9, 2020 at 3:39 am #

Very useful report. Thanks for sharing!

Reply
- Jason Brownlee January 9, 2020 at 7:31 am #
  
  Thanks!
  
  Reply
Ibrahima November 10, 2021 at 12:32 pm #

Thank you very much, it is a great explanation.

Reply

Navigation

How to Prepare Movie Review Data for Sentiment Analysis (Text Classification)

Tutorial Overview

Need help with Deep Learning for Text Data?

1. Movie Review Dataset

2. Load Text Data

3. Clean Text Data

4. Develop Vocabulary

5. Save Prepared Data

Extensions

Further Reading

Papers

APIs

Summary

Develop Deep Learning models for Text Data Today!

Develop Your Own Text models in Minutes

Finally Bring Deep Learning to your Natural Language Processing Projects

More On This Topic

48 Responses to How to Prepare Movie Review Data for Sentiment Analysis (Text Classification)

Leave a Reply Click here to cancel reply.