Deep Convolutional Neural Network for Sentiment Analysis (Text Classification)

By Jason Brownlee on September 3, 2020 in Deep Learning for Natural Language Processing 258

Develop a Deep Learning Model to Automatically Classify Movie Reviews
as Positive or Negative in Python with Keras, Step-by-Step.

Word embeddings are a technique for representing text where different words with similar meaning have a similar real-valued vector representation.

They are a key breakthrough that has led to great performance of neural network models on a suite of challenging natural language processing problems.

In this tutorial, you will discover how to develop word embedding models for neural networks to classify movie reviews.

After completing this tutorial, you will know:

How to prepare movie review text data for classification with deep learning methods.
How to learn a word embedding as part of fitting a deep learning model.
How to learn a standalone word embedding and how to use a pre-trained embedding in a neural network model.

Kick-start your project with my new book Deep Learning for Natural Language Processing, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Update Nov/2019: Fixed code typo when preparing training dataset (thanks HSA).
Update Aug/2020: Updated link to movie review dataset.

How to Develop a Word Embedding Model for Predicting Movie Review Sentiment
Photo by Katrina Br*?#*!@nd, some rights reserved.

Tutorial Overview

This tutorial is divided into 5 parts; they are:

Movie Review Dataset
Data Preparation
Train Embedding Layer
Train word2vec Embedding
Use Pre-trained Embedding

Python Environment

This tutorial assumes you have a Python SciPy environment installed, ideally with Python 3.

You must have Keras (2.2 or higher) installed with either the TensorFlow or Theano backend.

The tutorial also assumes you have scikit-learn, Pandas, NumPy, and Matplotlib installed.

If you need help with your environment, see this tutorial:

How to Setup a Python Environment for Machine Learning and Deep Learning with Anaconda

A GPU is not required for this tutorial, nevertheless, you can access GPUs cheaply on Amazon Web Services. Learn how in this tutorial:

How to Setup Amazon AWS EC2 GPUs to Train Keras Deep Learning Models (step-by-step)

Let’s dive in.

Need help with Deep Learning for Text Data?

Take my free 7-day email crash course now (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

1. Movie Review Dataset

The Movie Review Data is a collection of movie reviews retrieved from the imdb.com website in the early 2000s by Bo Pang and Lillian Lee. The reviews were collected and made available as part of their research on natural language processing.

The reviews were originally released in 2002, but an updated and cleaned up version were released in 2004, referred to as “v2.0”.

The dataset is comprised of 1,000 positive and 1,000 negative movie reviews drawn from an archive of the rec.arts.movies.reviews newsgroup hosted at imdb.com. The authors refer to this dataset as the “polarity dataset.”

Our data contains 1000 positive and 1000 negative reviews all written before 2002, with a cap of 20 reviews per author (312 authors total) per category. We refer to this corpus as the polarity dataset.

— A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts, 2004.

The data has been cleaned up somewhat, for example:

The dataset is comprised of only English reviews.
All text has been converted to lowercase.
There is white space around punctuation like periods, commas, and brackets.
Text has been split into one sentence per line.

The data has been used for a few related natural language processing tasks. For classification, the performance of machine learning models (such as Support Vector Machines) on the data is in the range of high 70% to low 80% (e.g. 78%-82%).

More sophisticated data preparation may see results as high as 86% with 10-fold cross validation. This gives us a ballpark of low-to-mid 80s if we were looking to use this dataset in experiments of modern methods.

… depending on choice of downstream polarity classifier, we can achieve highly statistically significant improvement (from 82.8% to 86.4%)

— A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts, 2004.

You can download the dataset from here:

Movie Review Polarity Dataset (review_polarity.tar.gz, 3MB)

After unzipping the file, you will have a directory called “txt_sentoken” with two sub-directories containing the text “neg” and “pos” for negative and positive reviews. Reviews are stored one per file with a naming convention cv000 to cv999 for each neg and pos.

Next, let’s look at loading and preparing the text data.

2. Data Preparation

In this section, we will look at 3 things:

Separation of data into training and test sets.
Loading and cleaning the data to remove punctuation and numbers.
Defining a vocabulary of preferred words.

Split into Train and Test Sets

We are pretending that we are developing a system that can predict the sentiment of a textual movie review as either positive or negative.

This means that after the model is developed, we will need to make predictions on new textual reviews. This will require all of the same data preparation to be performed on those new reviews as is performed on the training data for the model.

We will ensure that this constraint is built into the evaluation of our models by splitting the training and test datasets prior to any data preparation. This means that any knowledge in the data in the test set that could help us better prepare the data (e.g. the words used) are unavailable in the preparation of data used for training the model.

That being said, we will use the last 100 positive reviews and the last 100 negative reviews as a test set (100 reviews) and the remaining 1,800 reviews as the training dataset.

This is a 90% train, 10% split of the data.

The split can be imposed easily by using the filenames of the reviews where reviews named 000 to 899 are for training data and reviews named 900 onwards are for test.

Loading and Cleaning Reviews

The text data is already pretty clean; not much preparation is required.

If you are new to cleaning text data, see this post:

How to Clean Text for Machine Learning with Python

Without getting bogged down too much in the details, we will prepare the data using the following way:

Split tokens on white space.
Remove all punctuation from words.
Remove all words that are not purely comprised of alphabetical characters.
Remove all words that are known stop words.
Remove all words that have a length <= 1 character.

We can put all of these steps into a function called clean_doc() that takes as an argument the raw text loaded from a file and returns a list of cleaned tokens. We can also define a function load_doc() that loads a document from file ready for use with the clean_doc() function.

An example of cleaning the first positive review is listed below.

from nltk.corpus import stopwords
import string

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# turn a doc into clean tokens
def clean_doc(doc):
	# split into tokens by white space
	tokens = doc.split()
	# remove punctuation from each token
	table = str.maketrans('', '', string.punctuation)
	tokens = [w.translate(table) for w in tokens]
	# remove remaining tokens that are not alphabetic
	tokens = [word for word in tokens if word.isalpha()]
	# filter out stop words
	stop_words = set(stopwords.words('english'))
	tokens = [w for w in tokens if not w in stop_words]
	# filter out short tokens
	tokens = [word for word in tokens if len(word) > 1]
	return tokens

# load the document
filename = 'txt_sentoken/pos/cv000_29590.txt'
text = load_doc(filename)
tokens = clean_doc(text)
print(tokens)

from nltk.corpus import stopwords

import string

# load doc into memory

def load_doc(filename):

# open the file as read only

file = open(filename, 'r')

# read all text

text = file.read()

# close the file

file.close()

return text

# turn a doc into clean tokens

def clean_doc(doc):

# split into tokens by white space

tokens = doc.split()

# remove punctuation from each token

table = str.maketrans('', '', string.punctuation)

tokens = [w.translate(table) for w in tokens]

# remove remaining tokens that are not alphabetic

tokens = [word for word in tokens if word.isalpha()]

# filter out stop words

stop_words = set(stopwords.words('english'))

tokens = [w for w in tokens if not w in stop_words]

# filter out short tokens

tokens = [word for word in tokens if len(word) > 1]

return tokens

# load the document

filename = 'txt_sentoken/pos/cv000_29590.txt'

text = load_doc(filename)

tokens = clean_doc(text)

print(tokens)

Running the example prints a long list of clean tokens.

There are many more cleaning steps we may want to explore and I leave them as further exercises.

I’d love to see what you can come up with.
Post your approaches and findings in the comments at the end.

...
'creepy', 'place', 'even', 'acting', 'hell', 'solid', 'dreamy', 'depp', 'turning', 'typically', 'strong', 'performance', 'deftly', 'handling', 'british', 'accent', 'ians', 'holm', 'joe', 'goulds', 'secret', 'richardson', 'dalmatians', 'log', 'great', 'supporting', 'roles', 'big', 'surprise', 'graham', 'cringed', 'first', 'time', 'opened', 'mouth', 'imagining', 'attempt', 'irish', 'accent', 'actually', 'wasnt', 'half', 'bad', 'film', 'however', 'good', 'strong', 'violencegore', 'sexuality', 'language', 'drug', 'content']

...

'creepy', 'place', 'even', 'acting', 'hell', 'solid', 'dreamy', 'depp', 'turning', 'typically', 'strong', 'performance', 'deftly', 'handling', 'british', 'accent', 'ians', 'holm', 'joe', 'goulds', 'secret', 'richardson', 'dalmatians', 'log', 'great', 'supporting', 'roles', 'big', 'surprise', 'graham', 'cringed', 'first', 'time', 'opened', 'mouth', 'imagining', 'attempt', 'irish', 'accent', 'actually', 'wasnt', 'half', 'bad', 'film', 'however', 'good', 'strong', 'violencegore', 'sexuality', 'language', 'drug', 'content']

Define a Vocabulary

It is important to define a vocabulary of known words when using a bag-of-words or embedding model.

The more words, the larger the representation of documents, therefore it is important to constrain the words to only those believed to be predictive. This is difficult to know beforehand and often it is important to test different hypotheses about how to construct a useful vocabulary.

We have already seen how we can remove punctuation and numbers from the vocabulary in the previous section. We can repeat this for all documents and build a set of all known words.

We can develop a vocabulary as a Counter, which is a dictionary mapping of words and their counts that allow us to easily update and query.

Each document can be added to the counter (a new function called add_doc_to_vocab()) and we can step over all of the reviews in the negative directory and then the positive directory (a new function called process_docs()).

The complete example is listed below.

from string import punctuation
from os import listdir
from collections import Counter
from nltk.corpus import stopwords

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# turn a doc into clean tokens
def clean_doc(doc):
	# split into tokens by white space
	tokens = doc.split()
	# remove punctuation from each token
	table = str.maketrans('', '', punctuation)
	tokens = [w.translate(table) for w in tokens]
	# remove remaining tokens that are not alphabetic
	tokens = [word for word in tokens if word.isalpha()]
	# filter out stop words
	stop_words = set(stopwords.words('english'))
	tokens = [w for w in tokens if not w in stop_words]
	# filter out short tokens
	tokens = [word for word in tokens if len(word) > 1]
	return tokens

# load doc and add to vocab
def add_doc_to_vocab(filename, vocab):
	# load doc
	doc = load_doc(filename)
	# clean doc
	tokens = clean_doc(doc)
	# update counts
	vocab.update(tokens)

# load all docs in a directory
def process_docs(directory, vocab, is_trian):
	# walk through all files in the folder
	for filename in listdir(directory):
		# skip any reviews in the test set
		if is_trian and filename.startswith('cv9'):
			continue
		if not is_trian and not filename.startswith('cv9'):
			continue
		# create the full path of the file to open
		path = directory + '/' + filename
		# add doc to vocab
		add_doc_to_vocab(path, vocab)

# define vocab
vocab = Counter()
# add all docs to vocab
process_docs('txt_sentoken/neg', vocab, True)
process_docs('txt_sentoken/pos', vocab, True)
# print the size of the vocab
print(len(vocab))
# print the top words in the vocab
print(vocab.most_common(50))

from string import punctuation

from os import listdir

from collections import Counter

from nltk.corpus import stopwords

# load doc into memory

def load_doc(filename):

# open the file as read only

file = open(filename, 'r')

# read all text

text = file.read()

# close the file

file.close()

return text

# turn a doc into clean tokens

def clean_doc(doc):

# split into tokens by white space

tokens = doc.split()

# remove punctuation from each token

table = str.maketrans('', '', punctuation)

tokens = [w.translate(table) for w in tokens]

# remove remaining tokens that are not alphabetic

tokens = [word for word in tokens if word.isalpha()]

# filter out stop words

stop_words = set(stopwords.words('english'))

tokens = [w for w in tokens if not w in stop_words]

# filter out short tokens

tokens = [word for word in tokens if len(word) > 1]

return tokens

# load doc and add to vocab

def add_doc_to_vocab(filename, vocab):

# load doc

doc = load_doc(filename)

# clean doc

tokens = clean_doc(doc)

# update counts

vocab.update(tokens)

# load all docs in a directory

def process_docs(directory, vocab, is_trian):

# walk through all files in the folder

for filename in listdir(directory):

# skip any reviews in the test set

if is_trian and filename.startswith('cv9'):

continue

if not is_trian and not filename.startswith('cv9'):

continue

# create the full path of the file to open

path = directory + '/' + filename

# add doc to vocab

add_doc_to_vocab(path, vocab)

# define vocab

vocab = Counter()

# add all docs to vocab

process_docs('txt_sentoken/neg', vocab, True)

process_docs('txt_sentoken/pos', vocab, True)

# print the size of the vocab

print(len(vocab))

# print the top words in the vocab

print(vocab.most_common(50))

Running the example shows that we have a vocabulary of 44,276 words.

We also can see a sample of the top 50 most used words in the movie reviews.

Note, that this vocabulary was constructed based on only those reviews in the training dataset.

44276
[('film', 7983), ('one', 4946), ('movie', 4826), ('like', 3201), ('even', 2262), ('good', 2080), ('time', 2041), ('story', 1907), ('films', 1873), ('would', 1844), ('much', 1824), ('also', 1757), ('characters', 1735), ('get', 1724), ('character', 1703), ('two', 1643), ('first', 1588), ('see', 1557), ('way', 1515), ('well', 1511), ('make', 1418), ('really', 1407), ('little', 1351), ('life', 1334), ('plot', 1288), ('people', 1269), ('could', 1248), ('bad', 1248), ('scene', 1241), ('movies', 1238), ('never', 1201), ('best', 1179), ('new', 1140), ('scenes', 1135), ('man', 1131), ('many', 1130), ('doesnt', 1118), ('know', 1092), ('dont', 1086), ('hes', 1024), ('great', 1014), ('another', 992), ('action', 985), ('love', 977), ('us', 967), ('go', 952), ('director', 948), ('end', 946), ('something', 945), ('still', 936)]

44276

[('film', 7983), ('one', 4946), ('movie', 4826), ('like', 3201), ('even', 2262), ('good', 2080), ('time', 2041), ('story', 1907), ('films', 1873), ('would', 1844), ('much', 1824), ('also', 1757), ('characters', 1735), ('get', 1724), ('character', 1703), ('two', 1643), ('first', 1588), ('see', 1557), ('way', 1515), ('well', 1511), ('make', 1418), ('really', 1407), ('little', 1351), ('life', 1334), ('plot', 1288), ('people', 1269), ('could', 1248), ('bad', 1248), ('scene', 1241), ('movies', 1238), ('never', 1201), ('best', 1179), ('new', 1140), ('scenes', 1135), ('man', 1131), ('many', 1130), ('doesnt', 1118), ('know', 1092), ('dont', 1086), ('hes', 1024), ('great', 1014), ('another', 992), ('action', 985), ('love', 977), ('us', 967), ('go', 952), ('director', 948), ('end', 946), ('something', 945), ('still', 936)]

We can step through the vocabulary and remove all words that have a low occurrence, such as only being used once or twice in all reviews.

For example, the following snippet will retrieve only the tokens that of appears 2 or more times in all reviews.

# keep tokens with a min occurrence
min_occurane = 2
tokens = [k for k,c in vocab.items() if c >= min_occurane]
print(len(tokens))

# keep tokens with a min occurrence

min_occurane = 2

tokens = [k for k,c in vocab.items() if c >= min_occurane]

print(len(tokens))

Running the above example with this addition shows that the vocabulary size drops by a little more than half its size from 44,276 to 25,767 words.

25767

25767

Finally, the vocabulary can be saved to a new file called vocab.txt that we can later load and use to filter movie reviews prior to encoding them for modeling. We define a new function called save_list() that saves the vocabulary to file, with one word per file.

For example:

# save list to file
def save_list(lines, filename):
	# convert lines to a single blob of text
	data = '\n'.join(lines)
	# open file
	file = open(filename, 'w')
	# write text
	file.write(data)
	# close file
	file.close()

# save tokens to a vocabulary file
save_list(tokens, 'vocab.txt')

# save list to file

def save_list(lines, filename):

# convert lines to a single blob of text

data = '\n'.join(lines)

# open file

file = open(filename, 'w')

# write text

file.write(data)

# close file

file.close()

# save tokens to a vocabulary file

save_list(tokens, 'vocab.txt')

Running the min occurrence filter on the vocabulary and saving it to file, you should now have a new file called vocab.txt with only the words we are interested in.

The order of words in your file will differ, but should look something like the following:

aberdeen
dupe
burt
libido
hamlet
arlene
available
corners
web
columbia
...

aberdeen

dupe

burt

libido

hamlet

arlene

available

corners

web

columbia

...

We are now ready to look at learning features from the reviews.

3. Train Embedding Layer

In this section, we will learn a word embedding while training a neural network on the classification problem.

A word embedding is a way of representing text where each word in the vocabulary is represented by a real valued vector in a high-dimensional space. The vectors are learned in such a way that words that have similar meanings will have similar representation in the vector space (close in the vector space). This is a more expressive representation for text than more classical methods like bag-of-words, where relationships between words or tokens are ignored, or forced in bigram and trigram approaches.

The real valued vector representation for words can be learned while training the neural network. We can do this in the Keras deep learning library using the Embedding layer.

If you are new to word embeddings, see the post:

What Are Word Embeddings for Text?

If you are new to word embedding layers in Keras, see the post:

How to Use Word Embedding Layers for Deep Learning with Keras

The first step is to load the vocabulary. We will use it to filter out words from movie reviews that we are not interested in.

If you have worked through the previous section, you should have a local file called ‘vocab.txt‘ with one word per line. We can load that file and build a vocabulary as a set for checking the validity of tokens.

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# load the vocabulary
vocab_filename = 'vocab.txt'
vocab = load_doc(vocab_filename)
vocab = vocab.split()
vocab = set(vocab)

# load doc into memory

def load_doc(filename):

# open the file as read only

file = open(filename, 'r')

# read all text

text = file.read()

# close the file

file.close()

return text

# load the vocabulary

vocab_filename = 'vocab.txt'

vocab = load_doc(vocab_filename)

vocab = vocab.split()

vocab = set(vocab)

Next, we need to load all of the training data movie reviews. For that we can adapt the process_docs() from the previous section to load the documents, clean them, and return them as a list of strings, with one document per string. We want each document to be a string for easy encoding as a sequence of integers later.

Cleaning the document involves splitting each review based on white space, removing punctuation, and then filtering out all tokens not in the vocabulary.

The updated clean_doc() function is listed below.

# turn a doc into clean tokens
def clean_doc(doc, vocab):
	# split into tokens by white space
	tokens = doc.split()
	# remove punctuation from each token
	table = str.maketrans('', '', punctuation)
	tokens = [w.translate(table) for w in tokens]
	# filter out tokens not in vocab
	tokens = [w for w in tokens if w in vocab]
	tokens = ' '.join(tokens)
	return tokens

# turn a doc into clean tokens

def clean_doc(doc, vocab):

# split into tokens by white space

tokens = doc.split()

# remove punctuation from each token

table = str.maketrans('', '', punctuation)

tokens = [w.translate(table) for w in tokens]

# filter out tokens not in vocab

tokens = [w for w in tokens if w in vocab]

tokens = ' '.join(tokens)

return tokens

The updated process_docs() can then call the clean_doc() for each document on the ‘pos‘ and ‘neg‘ directories that are in our training dataset.

# load all docs in a directory
def process_docs(directory, vocab, is_trian):
	documents = list()
	# walk through all files in the folder
	for filename in listdir(directory):
		# skip any reviews in the test set
		if is_trian and filename.startswith('cv9'):
			continue
		if not is_trian and not filename.startswith('cv9'):
			continue
		# create the full path of the file to open
		path = directory + '/' + filename
		# load the doc
		doc = load_doc(path)
		# clean doc
		tokens = clean_doc(doc, vocab)
		# add to list
		documents.append(tokens)
	return documents

# load all training reviews
positive_docs = process_docs('txt_sentoken/pos', vocab, True)
negative_docs = process_docs('txt_sentoken/neg', vocab, True)
train_docs = negative_docs + positive_docs

# load all docs in a directory

def process_docs(directory, vocab, is_trian):

documents = list()

# walk through all files in the folder

for filename in listdir(directory):

# skip any reviews in the test set

if is_trian and filename.startswith('cv9'):

continue

if not is_trian and not filename.startswith('cv9'):

continue

# create the full path of the file to open

path = directory + '/' + filename

# load the doc

doc = load_doc(path)

# clean doc

tokens = clean_doc(doc, vocab)

# add to list

documents.append(tokens)

return documents

# load all training reviews

positive_docs = process_docs('txt_sentoken/pos', vocab, True)

negative_docs = process_docs('txt_sentoken/neg', vocab, True)

train_docs = negative_docs + positive_docs

The next step is to encode each document as a sequence of integers.

The Keras Embedding layer requires integer inputs where each integer maps to a single token that has a specific real-valued vector representation within the embedding. These vectors are random at the beginning of training, but during training become meaningful to the network.

We can encode the training documents as sequences of integers using the Tokenizer class in the Keras API.

First, we must construct an instance of the class then train it on all documents in the training dataset. In this case, it develops a vocabulary of all tokens in the training dataset and develops a consistent mapping from words in the vocabulary to unique integers. We could just as easily develop this mapping ourselves using our vocabulary file.

# create the tokenizer
tokenizer = Tokenizer()
# fit the tokenizer on the documents
tokenizer.fit_on_texts(train_docs)

# create the tokenizer

tokenizer = Tokenizer()

# fit the tokenizer on the documents

tokenizer.fit_on_texts(train_docs)

Now that the mapping of words to integers has been prepared, we can use it to encode the reviews in the training dataset. We can do that by calling the texts_to_sequences() function on the Tokenizer.

# sequence encode
encoded_docs = tokenizer.texts_to_sequences(train_docs)

1 2	# sequence encode encoded_docs = tokenizer.texts_to_sequences(train_docs)

We also need to ensure that all documents have the same length.

This is a requirement of Keras for efficient computation. We could truncate reviews to the smallest size or zero-pad (pad with the value ‘0’) reviews to the maximum length, or some hybrid. In this case, we will pad all reviews to the length of the longest review in the training dataset.

First, we can find the longest review using the max() function on the training dataset and take its length. We can then call the Keras function pad_sequences() to pad the sequences to the maximum length by adding 0 values on the end.

# pad sequences
max_length = max([len(s.split()) for s in train_docs])
Xtrain = pad_sequences(encoded_docs, maxlen=max_length, padding='post')

# pad sequences

max_length = max([len(s.split()) for s in train_docs])

Xtrain = pad_sequences(encoded_docs, maxlen=max_length, padding='post')

Finally, we can define the class labels for the training dataset, needed to fit the supervised neural network model to predict the sentiment of reviews.

# define training labels
ytrain = array([0 for _ in range(900)] + [1 for _ in range(900)])

1 2	# define training labels ytrain = array([0 for _ in range(900)] + [1 for _ in range(900)])

We can then encode and pad the test dataset, needed later to evaluate the model after we train it.

# load all test reviews
positive_docs = process_docs('txt_sentoken/pos', vocab, False)
negative_docs = process_docs('txt_sentoken/neg', vocab, False)
test_docs = negative_docs + positive_docs
# sequence encode
encoded_docs = tokenizer.texts_to_sequences(test_docs)
# pad sequences
Xtest = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
# define test labels
ytest = array([0 for _ in range(100)] + [1 for _ in range(100)])

# load all test reviews

positive_docs = process_docs('txt_sentoken/pos', vocab, False)

negative_docs = process_docs('txt_sentoken/neg', vocab, False)

test_docs = negative_docs + positive_docs

# sequence encode

encoded_docs = tokenizer.texts_to_sequences(test_docs)

# pad sequences

Xtest = pad_sequences(encoded_docs, maxlen=max_length, padding='post')

# define test labels

ytest = array([0 for _ in range(100)] + [1 for _ in range(100)])

We are now ready to define our neural network model.

The model will use an Embedding layer as the first hidden layer. The Embedding requires the specification of the vocabulary size, the size of the real-valued vector space, and the maximum length of input documents.

The vocabulary size is the total number of words in our vocabulary, plus one for unknown words. This could be the vocab set length or the size of the vocab within the tokenizer used to integer encode the documents, for example:

# define vocabulary size (largest integer value)
vocab_size = len(tokenizer.word_index) + 1

1 2	# define vocabulary size (largest integer value) vocab_size = len(tokenizer.word_index) + 1

We will use a 100-dimensional vector space, but you could try other values, such as 50 or 150. Finally, the maximum document length was calculated above in the max_length variable used during padding.

The complete model definition is listed below including the Embedding layer.

We use a Convolutional Neural Network (CNN) as they have proven to be successful at document classification problems. A conservative CNN configuration is used with 32 filters (parallel fields for processing words) and a kernel size of 8 with a rectified linear (‘relu’) activation function. This is followed by a pooling layer that reduces the output of the convolutional layer by half.

Next, the 2D output from the CNN part of the model is flattened to one long 2D vector to represent the ‘features’ extracted by the CNN. The back-end of the model is a standard Multilayer Perceptron layers to interpret the CNN features. The output layer uses a sigmoid activation function to output a value between 0 and 1 for the negative and positive sentiment in the review.

For more advice on effective deep learning model configuration for text classification, see the post:

Best Practices for Document Classification with Deep Learning

# define model
model = Sequential()
model.add(Embedding(vocab_size, 100, input_length=max_length))
model.add(Conv1D(filters=32, kernel_size=8, activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(10, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
print(model.summary())

# define model

model = Sequential()

model.add(Embedding(vocab_size, 100, input_length=max_length))

model.add(Conv1D(filters=32, kernel_size=8, activation='relu'))

model.add(MaxPooling1D(pool_size=2))

model.add(Flatten())

model.add(Dense(10, activation='relu'))

model.add(Dense(1, activation='sigmoid'))

print(model.summary())

Running just this piece provides a summary of the defined network.

We can see that the Embedding layer expects documents with a length of 442 words as input and encodes each word in the document as a 100 element vector.

_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
embedding_1 (Embedding)      (None, 442, 100)          2576800
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 435, 32)           25632
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 217, 32)           0
_________________________________________________________________
flatten_1 (Flatten)          (None, 6944)              0
_________________________________________________________________
dense_1 (Dense)              (None, 10)                69450
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 11
=================================================================
Total params: 2,671,893
Trainable params: 2,671,893
Non-trainable params: 0
_________________________________________________________________

_________________________________________________________________

Layer (type) Output Shape Param #

=================================================================

embedding_1 (Embedding) (None, 442, 100) 2576800

_________________________________________________________________

conv1d_1 (Conv1D) (None, 435, 32) 25632

_________________________________________________________________

max_pooling1d_1 (MaxPooling1 (None, 217, 32) 0

_________________________________________________________________

flatten_1 (Flatten) (None, 6944) 0

_________________________________________________________________

dense_1 (Dense) (None, 10) 69450

_________________________________________________________________

dense_2 (Dense) (None, 1) 11

=================================================================

Total params: 2,671,893

Trainable params: 2,671,893

Non-trainable params: 0

_________________________________________________________________

Next, we fit the network on the training data.

We use a binary cross entropy loss function because the problem we are learning is a binary classification problem. The efficient Adam implementation of stochastic gradient descent is used and we keep track of accuracy in addition to loss during training. The model is trained for 10 epochs, or 10 passes through the training data.

The network configuration and training schedule were found with a little trial and error, but are by no means optimal for this problem. If you can get better results with a different configuration, let me know.

# compile network
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit network
model.fit(Xtrain, ytrain, epochs=10, verbose=2)

# compile network

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# fit network

model.fit(Xtrain, ytrain, epochs=10, verbose=2)

After the model is fit, it is evaluated on the test dataset. This dataset contains words that we have not seen before and reviews not seen during training.

# evaluate
loss, acc = model.evaluate(Xtest, ytest, verbose=0)
print('Test Accuracy: %f' % (acc*100))

# evaluate

loss, acc = model.evaluate(Xtest, ytest, verbose=0)

print('Test Accuracy: %f' % (acc*100))

We can tie all of this together.

The complete code listing is provided below.

from string import punctuation
from os import listdir
from numpy import array
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Embedding
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# turn a doc into clean tokens
def clean_doc(doc, vocab):
	# split into tokens by white space
	tokens = doc.split()
	# remove punctuation from each token
	table = str.maketrans('', '', punctuation)
	tokens = [w.translate(table) for w in tokens]
	# filter out tokens not in vocab
	tokens = [w for w in tokens if w in vocab]
	tokens = ' '.join(tokens)
	return tokens

# load all docs in a directory
def process_docs(directory, vocab, is_trian):
	documents = list()
	# walk through all files in the folder
	for filename in listdir(directory):
		# skip any reviews in the test set
		if is_trian and filename.startswith('cv9'):
			continue
		if not is_trian and not filename.startswith('cv9'):
			continue
		# create the full path of the file to open
		path = directory + '/' + filename
		# load the doc
		doc = load_doc(path)
		# clean doc
		tokens = clean_doc(doc, vocab)
		# add to list
		documents.append(tokens)
	return documents

# load the vocabulary
vocab_filename = 'vocab.txt'
vocab = load_doc(vocab_filename)
vocab = vocab.split()
vocab = set(vocab)

# load all training reviews
positive_docs = process_docs('txt_sentoken/pos', vocab, True)
negative_docs = process_docs('txt_sentoken/neg', vocab, True)
train_docs = negative_docs + positive_docs

# create the tokenizer
tokenizer = Tokenizer()
# fit the tokenizer on the documents
tokenizer.fit_on_texts(train_docs)

# sequence encode
encoded_docs = tokenizer.texts_to_sequences(train_docs)
# pad sequences
max_length = max([len(s.split()) for s in train_docs])
Xtrain = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
# define training labels
ytrain = array([0 for _ in range(900)] + [1 for _ in range(900)])

# load all test reviews
positive_docs = process_docs('txt_sentoken/pos', vocab, False)
negative_docs = process_docs('txt_sentoken/neg', vocab, False)
test_docs = negative_docs + positive_docs
# sequence encode
encoded_docs = tokenizer.texts_to_sequences(test_docs)
# pad sequences
Xtest = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
# define test labels
ytest = array([0 for _ in range(100)] + [1 for _ in range(100)])

# define vocabulary size (largest integer value)
vocab_size = len(tokenizer.word_index) + 1

# define model
model = Sequential()
model.add(Embedding(vocab_size, 100, input_length=max_length))
model.add(Conv1D(filters=32, kernel_size=8, activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(10, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
print(model.summary())
# compile network
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit network
model.fit(Xtrain, ytrain, epochs=10, verbose=2)
# evaluate
loss, acc = model.evaluate(Xtest, ytest, verbose=0)
print('Test Accuracy: %f' % (acc*100))

100

101

102

103

104

105

106

107

108

from string import punctuation

from os import listdir

from numpy import array

from keras.preprocessing.text import Tokenizer

from keras.preprocessing.sequence import pad_sequences

from keras.models import Sequential

from keras.layers import Dense

from keras.layers import Flatten

from keras.layers import Embedding

from keras.layers.convolutional import Conv1D

from keras.layers.convolutional import MaxPooling1D

# load doc into memory

def load_doc(filename):

# open the file as read only

file = open(filename, 'r')

# read all text

text = file.read()

# close the file

file.close()

return text

# turn a doc into clean tokens

def clean_doc(doc, vocab):

# split into tokens by white space

tokens = doc.split()

# remove punctuation from each token

table = str.maketrans('', '', punctuation)

tokens = [w.translate(table) for w in tokens]

# filter out tokens not in vocab

tokens = [w for w in tokens if w in vocab]

tokens = ' '.join(tokens)

return tokens

# load all docs in a directory

def process_docs(directory, vocab, is_trian):

documents = list()

# walk through all files in the folder

for filename in listdir(directory):

# skip any reviews in the test set

if is_trian and filename.startswith('cv9'):

continue

if not is_trian and not filename.startswith('cv9'):

continue

# create the full path of the file to open

path = directory + '/' + filename

# load the doc

doc = load_doc(path)

# clean doc

tokens = clean_doc(doc, vocab)

# add to list

documents.append(tokens)

return documents

# load the vocabulary

vocab_filename = 'vocab.txt'

vocab = load_doc(vocab_filename)

vocab = vocab.split()

vocab = set(vocab)

# load all training reviews

positive_docs = process_docs('txt_sentoken/pos', vocab, True)

negative_docs = process_docs('txt_sentoken/neg', vocab, True)

train_docs = negative_docs + positive_docs

# create the tokenizer

tokenizer = Tokenizer()

# fit the tokenizer on the documents

tokenizer.fit_on_texts(train_docs)

# sequence encode

encoded_docs = tokenizer.texts_to_sequences(train_docs)

# pad sequences

max_length = max([len(s.split()) for s in train_docs])

Xtrain = pad_sequences(encoded_docs, maxlen=max_length, padding='post')

# define training labels

ytrain = array([0 for _ in range(900)] + [1 for _ in range(900)])

# load all test reviews

positive_docs = process_docs('txt_sentoken/pos', vocab, False)

negative_docs = process_docs('txt_sentoken/neg', vocab, False)

test_docs = negative_docs + positive_docs

# sequence encode

encoded_docs = tokenizer.texts_to_sequences(test_docs)

# pad sequences

Xtest = pad_sequences(encoded_docs, maxlen=max_length, padding='post')

# define test labels

ytest = array([0 for _ in range(100)] + [1 for _ in range(100)])

# define vocabulary size (largest integer value)

vocab_size = len(tokenizer.word_index) + 1

# define model

model = Sequential()

model.add(Embedding(vocab_size, 100, input_length=max_length))

model.add(Conv1D(filters=32, kernel_size=8, activation='relu'))

model.add(MaxPooling1D(pool_size=2))

model.add(Flatten())

model.add(Dense(10, activation='relu'))

model.add(Dense(1, activation='sigmoid'))

print(model.summary())

# compile network

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# fit network

model.fit(Xtrain, ytrain, epochs=10, verbose=2)

# evaluate

loss, acc = model.evaluate(Xtest, ytest, verbose=0)

print('Test Accuracy: %f' % (acc*100))

Running the example prints the loss and accuracy at the end of each training epoch.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

We can see that the model very quickly achieves 100% accuracy on the training dataset. At the end of the run, the model achieves an accuracy of 84.5% on the test dataset, which is a great score.

...
Epoch 6/10
2s - loss: 0.0013 - acc: 1.0000
Epoch 7/10
2s - loss: 8.4573e-04 - acc: 1.0000
Epoch 8/10
2s - loss: 5.8323e-04 - acc: 1.0000
Epoch 9/10
2s - loss: 4.3155e-04 - acc: 1.0000
Epoch 10/10
2s - loss: 3.3083e-04 - acc: 1.0000
Test Accuracy: 84.500000

...

Epoch 6/10

2s - loss: 0.0013 - acc: 1.0000

Epoch 7/10

2s - loss: 8.4573e-04 - acc: 1.0000

Epoch 8/10

2s - loss: 5.8323e-04 - acc: 1.0000

Epoch 9/10

2s - loss: 4.3155e-04 - acc: 1.0000

Epoch 10/10

2s - loss: 3.3083e-04 - acc: 1.0000

Test Accuracy: 84.500000

We have just seen an example of how we can learn a word embedding as part of fitting a neural network model.

Next, let’s look at how we can efficiently learn a standalone embedding that we could later use in our neural network.

4. Train word2vec Embedding

In this section, we will discover how to learn a standalone word embedding using an efficient algorithm called word2vec.

A downside of learning a word embedding as part of the network is that it can be very slow, especially for very large text datasets.

The word2vec algorithm is an approach to learning a word embedding from a text corpus in a standalone way. The benefit of the method is that it can produce high-quality word embeddings very efficiently, in terms of space and time complexity.

The first step is to prepare the documents ready for learning the embedding.

This involves the same data cleaning steps from the previous section, namely splitting documents by their white space, removing punctuation, and filtering out tokens not in the vocabulary.

The word2vec algorithm processes documents sentence by sentence. This means we will preserve the sentence-based structure during cleaning.

We start by loading the vocabulary, as before.

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# load the vocabulary
vocab_filename = 'vocab.txt'
vocab = load_doc(vocab_filename)
vocab = vocab.split()
vocab = set(vocab)

# load doc into memory

def load_doc(filename):

# open the file as read only

file = open(filename, 'r')

# read all text

text = file.read()

# close the file

file.close()

return text

# load the vocabulary

vocab_filename = 'vocab.txt'

vocab = load_doc(vocab_filename)

vocab = vocab.split()

vocab = set(vocab)

Next, we define a function named doc_to_clean_lines() to clean a loaded document line by line and return a list of the cleaned lines.

# turn a doc into clean tokens
def doc_to_clean_lines(doc, vocab):
	clean_lines = list()
	lines = doc.splitlines()
	for line in lines:
		# split into tokens by white space
		tokens = line.split()
		# remove punctuation from each token
		table = str.maketrans('', '', punctuation)
		tokens = [w.translate(table) for w in tokens]
		# filter out tokens not in vocab
		tokens = [w for w in tokens if w in vocab]
		clean_lines.append(tokens)
	return clean_lines

# turn a doc into clean tokens

def doc_to_clean_lines(doc, vocab):

clean_lines = list()

lines = doc.splitlines()

for line in lines:

# split into tokens by white space

tokens = line.split()

# remove punctuation from each token

table = str.maketrans('', '', punctuation)

tokens = [w.translate(table) for w in tokens]

# filter out tokens not in vocab

tokens = [w for w in tokens if w in vocab]

clean_lines.append(tokens)

return clean_lines

Next, we adapt the process_docs() function to load and clean all of the documents in a folder and return a list of all document lines.

The results from this function will be the training data for the word2vec model.

# load all docs in a directory
def process_docs(directory, vocab, is_trian):
	lines = list()
	# walk through all files in the folder
	for filename in listdir(directory):
		# skip any reviews in the test set
		if is_trian and filename.startswith('cv9'):
			continue
		if not is_trian and not filename.startswith('cv9'):
			continue
		# create the full path of the file to open
		path = directory + '/' + filename
		# load and clean the doc
		doc = load_doc(path)
		doc_lines = doc_to_clean_lines(doc, vocab)
		# add lines to list
		lines += doc_lines
	return lines

# load all docs in a directory

def process_docs(directory, vocab, is_trian):

lines = list()

# walk through all files in the folder

for filename in listdir(directory):

# skip any reviews in the test set

if is_trian and filename.startswith('cv9'):

continue

if not is_trian and not filename.startswith('cv9'):

continue

# create the full path of the file to open

path = directory + '/' + filename

# load and clean the doc

doc = load_doc(path)

doc_lines = doc_to_clean_lines(doc, vocab)

# add lines to list

lines += doc_lines

return lines

We can then load all of the training data and convert it into a long list of ‘sentences’ (lists of tokens) ready for fitting the word2vec model.

# load training data
positive_docs = process_docs('txt_sentoken/pos', vocab, True)
negative_docs = process_docs('txt_sentoken/neg', vocab, True)
sentences = negative_docs + positive_docs
print('Total training sentences: %d' % len(sentences))

# load training data

positive_docs = process_docs('txt_sentoken/pos', vocab, True)

negative_docs = process_docs('txt_sentoken/neg', vocab, True)

sentences = negative_docs + positive_docs

print('Total training sentences: %d' % len(sentences))

We will use the word2vec implementation provided in the Gensim Python library. Specifically the Word2Vec class.

For more on training a standalone word embedding with Gensim, see the post:

How to Develop Word Embeddings in Python with Gensim

The model is fit when constructing the class. We pass in the list of clean sentences from the training data, then specify the size of the embedding vector space (we use 100 again), the number of neighboring words to look at when learning how to embed each word in the training sentences (we use 5 neighbors), the number of threads to use when fitting the model (we use 8, but change this if you have more or less CPU cores), and the minimum occurrence count for words to consider in the vocabulary (we set this to 1 as we have already prepared the vocabulary).

After the model is fit, we print the size of the learned vocabulary, which should match the size of our vocabulary in vocab.txt of 25,767 tokens.

# train word2vec model
model = Word2Vec(sentences, size=100, window=5, workers=8, min_count=1)
# summarize vocabulary size in model
words = list(model.wv.vocab)
print('Vocabulary size: %d' % len(words))

# train word2vec model

model = Word2Vec(sentences, size=100, window=5, workers=8, min_count=1)

# summarize vocabulary size in model

words = list(model.wv.vocab)

print('Vocabulary size: %d' % len(words))

Finally, we save the learned embedding vectors to file using the save_word2vec_format() on the model’s ‘wv‘ (word vector) attribute. The embedding is saved in ASCII format with one word and vector per line.

The complete example is listed below.

from string import punctuation
from os import listdir
from gensim.models import Word2Vec

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# turn a doc into clean tokens
def doc_to_clean_lines(doc, vocab):
	clean_lines = list()
	lines = doc.splitlines()
	for line in lines:
		# split into tokens by white space
		tokens = line.split()
		# remove punctuation from each token
		table = str.maketrans('', '', punctuation)
		tokens = [w.translate(table) for w in tokens]
		# filter out tokens not in vocab
		tokens = [w for w in tokens if w in vocab]
		clean_lines.append(tokens)
	return clean_lines

# load all docs in a directory
def process_docs(directory, vocab, is_trian):
	lines = list()
	# walk through all files in the folder
	for filename in listdir(directory):
		# skip any reviews in the test set
		if is_trian and filename.startswith('cv9'):
			continue
		if not is_trian and not filename.startswith('cv9'):
			continue
		# create the full path of the file to open
		path = directory + '/' + filename
		# load and clean the doc
		doc = load_doc(path)
		doc_lines = doc_to_clean_lines(doc, vocab)
		# add lines to list
		lines += doc_lines
	return lines

# load the vocabulary
vocab_filename = 'vocab.txt'
vocab = load_doc(vocab_filename)
vocab = vocab.split()
vocab = set(vocab)

# load training data
positive_docs = process_docs('txt_sentoken/pos', vocab, True)
negative_docs = process_docs('txt_sentoken/neg', vocab, True)
sentences = negative_docs + positive_docs
print('Total training sentences: %d' % len(sentences))

# train word2vec model
model = Word2Vec(sentences, size=100, window=5, workers=8, min_count=1)
# summarize vocabulary size in model
words = list(model.wv.vocab)
print('Vocabulary size: %d' % len(words))

# save model in ASCII (word2vec) format
filename = 'embedding_word2vec.txt'
model.wv.save_word2vec_format(filename, binary=False)

from string import punctuation

from os import listdir

from gensim.models import Word2Vec

# load doc into memory

def load_doc(filename):

# open the file as read only

file = open(filename, 'r')

# read all text

text = file.read()

# close the file

file.close()

return text

# turn a doc into clean tokens

def doc_to_clean_lines(doc, vocab):

clean_lines = list()

lines = doc.splitlines()

for line in lines:

# split into tokens by white space

tokens = line.split()

# remove punctuation from each token

table = str.maketrans('', '', punctuation)

tokens = [w.translate(table) for w in tokens]

# filter out tokens not in vocab

tokens = [w for w in tokens if w in vocab]

clean_lines.append(tokens)

return clean_lines

# load all docs in a directory

def process_docs(directory, vocab, is_trian):

lines = list()

# walk through all files in the folder

for filename in listdir(directory):

# skip any reviews in the test set

if is_trian and filename.startswith('cv9'):

continue

if not is_trian and not filename.startswith('cv9'):

continue

# create the full path of the file to open

path = directory + '/' + filename

# load and clean the doc

doc = load_doc(path)

doc_lines = doc_to_clean_lines(doc, vocab)

# add lines to list

lines += doc_lines

return lines

# load the vocabulary

vocab_filename = 'vocab.txt'

vocab = load_doc(vocab_filename)

vocab = vocab.split()

vocab = set(vocab)

# load training data

positive_docs = process_docs('txt_sentoken/pos', vocab, True)

negative_docs = process_docs('txt_sentoken/neg', vocab, True)

sentences = negative_docs + positive_docs

print('Total training sentences: %d' % len(sentences))

# train word2vec model

model = Word2Vec(sentences, size=100, window=5, workers=8, min_count=1)

# summarize vocabulary size in model

words = list(model.wv.vocab)

print('Vocabulary size: %d' % len(words))

# save model in ASCII (word2vec) format

filename = 'embedding_word2vec.txt'

model.wv.save_word2vec_format(filename, binary=False)

Running the example loads 58,109 sentences from the training data and creates an embedding for a vocabulary of 25,767 words.

You should now have a file ’embedding_word2vec.txt’ with the learned vectors in your current working directory.

Total training sentences: 58109
Vocabulary size: 25767

1 2	Total training sentences: 58109 Vocabulary size: 25767

Next, let’s look at using these learned vectors in our model.

5. Use Pre-trained Embedding

In this section, we will use a pre-trained word embedding prepared on a very large text corpus.

We can use the pre-trained word embedding developed in the previous section and the CNN model developed in the section before that.

The first step is to load the word embedding as a directory of words to vectors. The word embedding was saved in so-called ‘word2vec‘ format that contains a header line. We will skip this header line when loading the embedding.

The function below named load_embedding() loads the embedding and returns a directory of words mapped to the vectors in NumPy format.

# load embedding as a dict
def load_embedding(filename):
	# load embedding into memory, skip first line
	file = open(filename,'r')
	lines = file.readlines()[1:]
	file.close()
	# create a map of words to vectors
	embedding = dict()
	for line in lines:
		parts = line.split()
		# key is string word, value is numpy array for vector
		embedding[parts[0]] = asarray(parts[1:], dtype='float32')
	return embedding

# load embedding as a dict

def load_embedding(filename):

# load embedding into memory, skip first line

file = open(filename,'r')

lines = file.readlines()[1:]

file.close()

# create a map of words to vectors

embedding = dict()

for line in lines:

parts = line.split()

# key is string word, value is numpy array for vector

embedding[parts[0]] = asarray(parts[1:], dtype='float32')

return embedding

Now that we have all of the vectors in memory, we can order them in such a way as to match the integer encoding prepared by the Keras Tokenizer.

Recall that we integer encode the review documents prior to passing them to the Embedding layer. The integer maps to the index of a specific vector in the embedding layer. Therefore, it is important that we lay the vectors out in the Embedding layer such that the encoded words map to the correct vector.

Below defines a function get_weight_matrix() that takes the loaded embedding and the tokenizer.word_index vocabulary as arguments and returns a matrix with the word vectors in the correct locations.

# create a weight matrix for the Embedding layer from a loaded embedding
def get_weight_matrix(embedding, vocab):
	# total vocabulary size plus 0 for unknown words
	vocab_size = len(vocab) + 1
	# define weight matrix dimensions with all 0
	weight_matrix = zeros((vocab_size, 100))
	# step vocab, store vectors using the Tokenizer's integer mapping
	for word, i in vocab.items():
		weight_matrix[i] = embedding.get(word)
	return weight_matrix

# create a weight matrix for the Embedding layer from a loaded embedding

def get_weight_matrix(embedding, vocab):

# total vocabulary size plus 0 for unknown words

vocab_size = len(vocab) + 1

# define weight matrix dimensions with all 0

weight_matrix = zeros((vocab_size, 100))

# step vocab, store vectors using the Tokenizer's integer mapping

for word, i in vocab.items():

weight_matrix[i] = embedding.get(word)

return weight_matrix

Now we can use these functions to create our new Embedding layer for our model.

...
# load embedding from file
raw_embedding = load_embedding('embedding_word2vec.txt')
# get vectors in the right order
embedding_vectors = get_weight_matrix(raw_embedding, tokenizer.word_index)
# create the embedding layer
embedding_layer = Embedding(vocab_size, 100, weights=[embedding_vectors], input_length=max_length, trainable=False)

...

# load embedding from file

raw_embedding = load_embedding('embedding_word2vec.txt')

# get vectors in the right order

embedding_vectors = get_weight_matrix(raw_embedding, tokenizer.word_index)

# create the embedding layer

embedding_layer = Embedding(vocab_size, 100, weights=[embedding_vectors], input_length=max_length, trainable=False)

Note that the prepared weight matrix embedding_vectors is passed to the new Embedding layer as an argument and that we set the ‘trainable‘ argument to ‘False‘ to ensure that the network does not try to adapt the pre-learned vectors as part of training the network.

We can now add this layer to our model. We also have a slightly different model configuration with a lot more filters (128) in the CNN model and a kernel that matches the 5 words used as neighbors when developing the word2vec embedding. Finally, the back-end of the model was simplified.

# define model
model = Sequential()
model.add(embedding_layer)
model.add(Conv1D(filters=128, kernel_size=5, activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
print(model.summary())

# define model

model = Sequential()

model.add(embedding_layer)

model.add(Conv1D(filters=128, kernel_size=5, activation='relu'))

model.add(MaxPooling1D(pool_size=2))

model.add(Flatten())

model.add(Dense(1, activation='sigmoid'))

print(model.summary())

These changes were found with a little trial and error.

The complete code listing is provided below.

from string import punctuation
from os import listdir
from numpy import array
from numpy import asarray
from numpy import zeros
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Embedding
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# turn a doc into clean tokens
def clean_doc(doc, vocab):
	# split into tokens by white space
	tokens = doc.split()
	# remove punctuation from each token
	table = str.maketrans('', '', punctuation)
	tokens = [w.translate(table) for w in tokens]
	# filter out tokens not in vocab
	tokens = [w for w in tokens if w in vocab]
	tokens = ' '.join(tokens)
	return tokens

# load all docs in a directory
def process_docs(directory, vocab, is_trian):
	documents = list()
	# walk through all files in the folder
	for filename in listdir(directory):
		# skip any reviews in the test set
		if is_trian and filename.startswith('cv9'):
			continue
		if not is_trian and not filename.startswith('cv9'):
			continue
		# create the full path of the file to open
		path = directory + '/' + filename
		# load the doc
		doc = load_doc(path)
		# clean doc
		tokens = clean_doc(doc, vocab)
		# add to list
		documents.append(tokens)
	return documents

# load embedding as a dict
def load_embedding(filename):
	# load embedding into memory, skip first line
	file = open(filename,'r')
	lines = file.readlines()[1:]
	file.close()
	# create a map of words to vectors
	embedding = dict()
	for line in lines:
		parts = line.split()
		# key is string word, value is numpy array for vector
		embedding[parts[0]] = asarray(parts[1:], dtype='float32')
	return embedding

# create a weight matrix for the Embedding layer from a loaded embedding
def get_weight_matrix(embedding, vocab):
	# total vocabulary size plus 0 for unknown words
	vocab_size = len(vocab) + 1
	# define weight matrix dimensions with all 0
	weight_matrix = zeros((vocab_size, 100))
	# step vocab, store vectors using the Tokenizer's integer mapping
	for word, i in vocab.items():
		weight_matrix[i] = embedding.get(word)
	return weight_matrix

# load the vocabulary
vocab_filename = 'vocab.txt'
vocab = load_doc(vocab_filename)
vocab = vocab.split()
vocab = set(vocab)

# load all training reviews
positive_docs = process_docs('txt_sentoken/pos', vocab, True)
negative_docs = process_docs('txt_sentoken/neg', vocab, True)
train_docs = negative_docs + positive_docs

# create the tokenizer
tokenizer = Tokenizer()
# fit the tokenizer on the documents
tokenizer.fit_on_texts(train_docs)

# sequence encode
encoded_docs = tokenizer.texts_to_sequences(train_docs)
# pad sequences
max_length = max([len(s.split()) for s in train_docs])
Xtrain = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
# define training labels
ytrain = array([0 for _ in range(900)] + [1 for _ in range(900)])

# load all test reviews
positive_docs = process_docs('txt_sentoken/pos', vocab, False)
negative_docs = process_docs('txt_sentoken/neg', vocab, False)
test_docs = negative_docs + positive_docs
# sequence encode
encoded_docs = tokenizer.texts_to_sequences(test_docs)
# pad sequences
Xtest = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
# define test labels
ytest = array([0 for _ in range(100)] + [1 for _ in range(100)])

# define vocabulary size (largest integer value)
vocab_size = len(tokenizer.word_index) + 1

# load embedding from file
raw_embedding = load_embedding('embedding_word2vec.txt')
# get vectors in the right order
embedding_vectors = get_weight_matrix(raw_embedding, tokenizer.word_index)
# create the embedding layer
embedding_layer = Embedding(vocab_size, 100, weights=[embedding_vectors], input_length=max_length, trainable=False)

# define model
model = Sequential()
model.add(embedding_layer)
model.add(Conv1D(filters=128, kernel_size=5, activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
print(model.summary())
# compile network
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit network
model.fit(Xtrain, ytrain, epochs=10, verbose=2)
# evaluate
loss, acc = model.evaluate(Xtest, ytest, verbose=0)
print('Test Accuracy: %f' % (acc*100))

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

from string import punctuation

from os import listdir

from numpy import array

from numpy import asarray

from numpy import zeros

from keras.preprocessing.text import Tokenizer

from keras.preprocessing.sequence import pad_sequences

from keras.models import Sequential

from keras.layers import Dense

from keras.layers import Flatten

from keras.layers import Embedding

from keras.layers.convolutional import Conv1D

from keras.layers.convolutional import MaxPooling1D

# load doc into memory

def load_doc(filename):

# open the file as read only

file = open(filename, 'r')

# read all text

text = file.read()

# close the file

file.close()

return text

# turn a doc into clean tokens

def clean_doc(doc, vocab):

# split into tokens by white space

tokens = doc.split()

# remove punctuation from each token

table = str.maketrans('', '', punctuation)

tokens = [w.translate(table) for w in tokens]

# filter out tokens not in vocab

tokens = [w for w in tokens if w in vocab]

tokens = ' '.join(tokens)

return tokens

# load all docs in a directory

def process_docs(directory, vocab, is_trian):

documents = list()

# walk through all files in the folder

for filename in listdir(directory):

# skip any reviews in the test set

if is_trian and filename.startswith('cv9'):

continue

if not is_trian and not filename.startswith('cv9'):

continue

# create the full path of the file to open

path = directory + '/' + filename

# load the doc

doc = load_doc(path)

# clean doc

tokens = clean_doc(doc, vocab)

# add to list

documents.append(tokens)

return documents

# load embedding as a dict

def load_embedding(filename):

# load embedding into memory, skip first line

file = open(filename,'r')

lines = file.readlines()[1:]

file.close()

# create a map of words to vectors

embedding = dict()

for line in lines:

parts = line.split()

# key is string word, value is numpy array for vector

embedding[parts[0]] = asarray(parts[1:], dtype='float32')

return embedding

# create a weight matrix for the Embedding layer from a loaded embedding

def get_weight_matrix(embedding, vocab):

# total vocabulary size plus 0 for unknown words

vocab_size = len(vocab) + 1

# define weight matrix dimensions with all 0

weight_matrix = zeros((vocab_size, 100))

# step vocab, store vectors using the Tokenizer's integer mapping

for word, i in vocab.items():

weight_matrix[i] = embedding.get(word)

return weight_matrix

# load the vocabulary

vocab_filename = 'vocab.txt'

vocab = load_doc(vocab_filename)

vocab = vocab.split()

vocab = set(vocab)

# load all training reviews

positive_docs = process_docs('txt_sentoken/pos', vocab, True)

negative_docs = process_docs('txt_sentoken/neg', vocab, True)

train_docs = negative_docs + positive_docs

# create the tokenizer

tokenizer = Tokenizer()

# fit the tokenizer on the documents

tokenizer.fit_on_texts(train_docs)

# sequence encode

encoded_docs = tokenizer.texts_to_sequences(train_docs)

# pad sequences

max_length = max([len(s.split()) for s in train_docs])

Xtrain = pad_sequences(encoded_docs, maxlen=max_length, padding='post')

# define training labels

ytrain = array([0 for _ in range(900)] + [1 for _ in range(900)])

# load all test reviews

positive_docs = process_docs('txt_sentoken/pos', vocab, False)

negative_docs = process_docs('txt_sentoken/neg', vocab, False)

test_docs = negative_docs + positive_docs

# sequence encode

encoded_docs = tokenizer.texts_to_sequences(test_docs)

# pad sequences

Xtest = pad_sequences(encoded_docs, maxlen=max_length, padding='post')

# define test labels

ytest = array([0 for _ in range(100)] + [1 for _ in range(100)])

# define vocabulary size (largest integer value)

vocab_size = len(tokenizer.word_index) + 1

# load embedding from file

raw_embedding = load_embedding('embedding_word2vec.txt')

# get vectors in the right order

embedding_vectors = get_weight_matrix(raw_embedding, tokenizer.word_index)

# create the embedding layer

embedding_layer = Embedding(vocab_size, 100, weights=[embedding_vectors], input_length=max_length, trainable=False)

# define model

model = Sequential()

model.add(embedding_layer)

model.add(Conv1D(filters=128, kernel_size=5, activation='relu'))

model.add(MaxPooling1D(pool_size=2))

model.add(Flatten())

model.add(Dense(1, activation='sigmoid'))

print(model.summary())

# compile network

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# fit network

model.fit(Xtrain, ytrain, epochs=10, verbose=2)

# evaluate

loss, acc = model.evaluate(Xtest, ytest, verbose=0)

print('Test Accuracy: %f' % (acc*100))

Running the example shows that performance was not improved.

In fact, performance was a lot worse. The results show that the training dataset was learned successfully, but evaluation on the test dataset was very poor, at just above 50% accuracy.

The cause of the poor test performance may be because of the chosen word2vec configuration or the chosen neural network configuration.

...
Epoch 6/10
2s - loss: 0.3306 - acc: 0.8778
Epoch 7/10
2s - loss: 0.2888 - acc: 0.8917
Epoch 8/10
2s - loss: 0.1878 - acc: 0.9439
Epoch 9/10
2s - loss: 0.1255 - acc: 0.9750
Epoch 10/10
2s - loss: 0.0812 - acc: 0.9928
Test Accuracy: 53.000000

...

Epoch 6/10

2s - loss: 0.3306 - acc: 0.8778

Epoch 7/10

2s - loss: 0.2888 - acc: 0.8917

Epoch 8/10

2s - loss: 0.1878 - acc: 0.9439

Epoch 9/10

2s - loss: 0.1255 - acc: 0.9750

Epoch 10/10

2s - loss: 0.0812 - acc: 0.9928

Test Accuracy: 53.000000

The weights in the embedding layer can be used as a starting point for the network, and adapted during the training of the network. We can do this by setting ‘trainable=True‘ (the default) in the creation of the embedding layer.

Repeating the experiment with this change shows slightly better results, but still poor.

I would encourage you to explore alternate configurations of the embedding and network to see if you can do better. Let me know how you do.

...
Epoch 6/10
4s - loss: 0.0950 - acc: 0.9917
Epoch 7/10
4s - loss: 0.0355 - acc: 0.9983
Epoch 8/10
4s - loss: 0.0158 - acc: 1.0000
Epoch 9/10
4s - loss: 0.0080 - acc: 1.0000
Epoch 10/10
4s - loss: 0.0050 - acc: 1.0000
Test Accuracy: 57.500000

...

Epoch 6/10

4s - loss: 0.0950 - acc: 0.9917

Epoch 7/10

4s - loss: 0.0355 - acc: 0.9983

Epoch 8/10

4s - loss: 0.0158 - acc: 1.0000

Epoch 9/10

4s - loss: 0.0080 - acc: 1.0000

Epoch 10/10

4s - loss: 0.0050 - acc: 1.0000

Test Accuracy: 57.500000

It is possible to use pre-trained word vectors prepared on very large corpora of text data.

For example, both Google and Stanford provide pre-trained word vectors that you can download, trained with the efficient word2vec and GloVe methods respectively.

Let’s try to use pre-trained vectors in our model.

You can download pre-trained GloVe vectors from the Stanford webpage. Specifically, vectors trained on Wikipedia data:

glove.6B.zip (822 Megabyte download)

Unzipping the file, you will find pre-trained embeddings for various different dimensions. We will load the 100 dimension version in the file ‘glove.6B.100d.txt‘

The Glove file does not contain a header file, so we do not need to skip the first line when loading the embedding into memory. The updated load_embedding() function is listed below.

# load embedding as a dict
def load_embedding(filename):
	# load embedding into memory, skip first line
	file = open(filename,'r')
	lines = file.readlines()
	file.close()
	# create a map of words to vectors
	embedding = dict()
	for line in lines:
		parts = line.split()
		# key is string word, value is numpy array for vector
		embedding[parts[0]] = asarray(parts[1:], dtype='float32')
	return embedding

# load embedding as a dict

def load_embedding(filename):

# load embedding into memory, skip first line

file = open(filename,'r')

lines = file.readlines()

file.close()

# create a map of words to vectors

embedding = dict()

for line in lines:

parts = line.split()

# key is string word, value is numpy array for vector

embedding[parts[0]] = asarray(parts[1:], dtype='float32')

return embedding

It is possible that the loaded embedding does not contain all of the words in our chosen vocabulary. As such, when creating the Embedding weight matrix, we need to skip words that do not have a corresponding vector in the loaded GloVe data. Below is the updated, more defensive version of the get_weight_matrix() function.

# create a weight matrix for the Embedding layer from a loaded embedding
def get_weight_matrix(embedding, vocab):
	# total vocabulary size plus 0 for unknown words
	vocab_size = len(vocab) + 1
	# define weight matrix dimensions with all 0
	weight_matrix = zeros((vocab_size, 100))
	# step vocab, store vectors using the Tokenizer's integer mapping
	for word, i in vocab.items():
		vector = embedding.get(word)
		if vector is not None:
			weight_matrix[i] = vector
	return weight_matrix

# create a weight matrix for the Embedding layer from a loaded embedding

def get_weight_matrix(embedding, vocab):

# total vocabulary size plus 0 for unknown words

vocab_size = len(vocab) + 1

# define weight matrix dimensions with all 0

weight_matrix = zeros((vocab_size, 100))

# step vocab, store vectors using the Tokenizer's integer mapping

for word, i in vocab.items():

vector = embedding.get(word)

if vector is not None:

weight_matrix[i] = vector

return weight_matrix

We can now load the GloVe embedding and create the Embedding layer as before.

# load embedding from file
raw_embedding = load_embedding('glove.6B.100d.txt')
# get vectors in the right order
embedding_vectors = get_weight_matrix(raw_embedding, tokenizer.word_index)
# create the embedding layer
embedding_layer = Embedding(vocab_size, 100, weights=[embedding_vectors], input_length=max_length, trainable=False)

# load embedding from file

raw_embedding = load_embedding('glove.6B.100d.txt')

# get vectors in the right order

embedding_vectors = get_weight_matrix(raw_embedding, tokenizer.word_index)

# create the embedding layer

embedding_layer = Embedding(vocab_size, 100, weights=[embedding_vectors], input_length=max_length, trainable=False)

We will use the same model as before.

The complete example is listed below.

from string import punctuation
from os import listdir
from numpy import array
from numpy import asarray
from numpy import zeros
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Embedding
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# turn a doc into clean tokens
def clean_doc(doc, vocab):
	# split into tokens by white space
	tokens = doc.split()
	# remove punctuation from each token
	table = str.maketrans('', '', punctuation)
	tokens = [w.translate(table) for w in tokens]
	# filter out tokens not in vocab
	tokens = [w for w in tokens if w in vocab]
	tokens = ' '.join(tokens)
	return tokens

# load all docs in a directory
def process_docs(directory, vocab, is_trian):
	documents = list()
	# walk through all files in the folder
	for filename in listdir(directory):
		# skip any reviews in the test set
		if is_trian and filename.startswith('cv9'):
			continue
		if not is_trian and not filename.startswith('cv9'):
			continue
		# create the full path of the file to open
		path = directory + '/' + filename
		# load the doc
		doc = load_doc(path)
		# clean doc
		tokens = clean_doc(doc, vocab)
		# add to list
		documents.append(tokens)
	return documents

# load embedding as a dict
def load_embedding(filename):
	# load embedding into memory, skip first line
	file = open(filename,'r')
	lines = file.readlines()
	file.close()
	# create a map of words to vectors
	embedding = dict()
	for line in lines:
		parts = line.split()
		# key is string word, value is numpy array for vector
		embedding[parts[0]] = asarray(parts[1:], dtype='float32')
	return embedding

# create a weight matrix for the Embedding layer from a loaded embedding
def get_weight_matrix(embedding, vocab):
	# total vocabulary size plus 0 for unknown words
	vocab_size = len(vocab) + 1
	# define weight matrix dimensions with all 0
	weight_matrix = zeros((vocab_size, 100))
	# step vocab, store vectors using the Tokenizer's integer mapping
	for word, i in vocab.items():
		vector = embedding.get(word)
		if vector is not None:
			weight_matrix[i] = vector
	return weight_matrix

# load the vocabulary
vocab_filename = 'vocab.txt'
vocab = load_doc(vocab_filename)
vocab = vocab.split()
vocab = set(vocab)

# load all training reviews
positive_docs = process_docs('txt_sentoken/pos', vocab, True)
negative_docs = process_docs('txt_sentoken/neg', vocab, True)
train_docs = negative_docs + positive_docs

# create the tokenizer
tokenizer = Tokenizer()
# fit the tokenizer on the documents
tokenizer.fit_on_texts(train_docs)

# sequence encode
encoded_docs = tokenizer.texts_to_sequences(train_docs)
# pad sequences
max_length = max([len(s.split()) for s in train_docs])
Xtrain = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
# define training labels
ytrain = array([0 for _ in range(900)] + [1 for _ in range(900)])

# load all test reviews
positive_docs = process_docs('txt_sentoken/pos', vocab, False)
negative_docs = process_docs('txt_sentoken/neg', vocab, False)
test_docs = negative_docs + positive_docs
# sequence encode
encoded_docs = tokenizer.texts_to_sequences(test_docs)
# pad sequences
Xtest = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
# define test labels
ytest = array([0 for _ in range(100)] + [1 for _ in range(100)])

# define vocabulary size (largest integer value)
vocab_size = len(tokenizer.word_index) + 1

# load embedding from file
raw_embedding = load_embedding('glove.6B.100d.txt')
# get vectors in the right order
embedding_vectors = get_weight_matrix(raw_embedding, tokenizer.word_index)
# create the embedding layer
embedding_layer = Embedding(vocab_size, 100, weights=[embedding_vectors], input_length=max_length, trainable=False)

# define model
model = Sequential()
model.add(embedding_layer)
model.add(Conv1D(filters=128, kernel_size=5, activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
print(model.summary())
# compile network
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit network
model.fit(Xtrain, ytrain, epochs=10, verbose=2)
# evaluate
loss, acc = model.evaluate(Xtest, ytest, verbose=0)
print('Test Accuracy: %f' % (acc*100))

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

from string import punctuation

from os import listdir

from numpy import array

from numpy import asarray

from numpy import zeros

from keras.preprocessing.text import Tokenizer

from keras.preprocessing.sequence import pad_sequences

from keras.models import Sequential

from keras.layers import Dense

from keras.layers import Flatten

from keras.layers import Embedding

from keras.layers.convolutional import Conv1D

from keras.layers.convolutional import MaxPooling1D

# load doc into memory

def load_doc(filename):

# open the file as read only

file = open(filename, 'r')

# read all text

text = file.read()

# close the file

file.close()

return text

# turn a doc into clean tokens

def clean_doc(doc, vocab):

# split into tokens by white space

tokens = doc.split()

# remove punctuation from each token

table = str.maketrans('', '', punctuation)

tokens = [w.translate(table) for w in tokens]

# filter out tokens not in vocab

tokens = [w for w in tokens if w in vocab]

tokens = ' '.join(tokens)

return tokens

# load all docs in a directory

def process_docs(directory, vocab, is_trian):

documents = list()

# walk through all files in the folder

for filename in listdir(directory):

# skip any reviews in the test set

if is_trian and filename.startswith('cv9'):

continue

if not is_trian and not filename.startswith('cv9'):

continue

# create the full path of the file to open

path = directory + '/' + filename

# load the doc

doc = load_doc(path)

# clean doc

tokens = clean_doc(doc, vocab)

# add to list

documents.append(tokens)

return documents

# load embedding as a dict

def load_embedding(filename):

# load embedding into memory, skip first line

file = open(filename,'r')

lines = file.readlines()

file.close()

# create a map of words to vectors

embedding = dict()

for line in lines:

parts = line.split()

# key is string word, value is numpy array for vector

embedding[parts[0]] = asarray(parts[1:], dtype='float32')

return embedding

# create a weight matrix for the Embedding layer from a loaded embedding

def get_weight_matrix(embedding, vocab):

# total vocabulary size plus 0 for unknown words

vocab_size = len(vocab) + 1

# define weight matrix dimensions with all 0

weight_matrix = zeros((vocab_size, 100))

# step vocab, store vectors using the Tokenizer's integer mapping

for word, i in vocab.items():

vector = embedding.get(word)

if vector is not None:

weight_matrix[i] = vector

return weight_matrix

# load the vocabulary

vocab_filename = 'vocab.txt'

vocab = load_doc(vocab_filename)

vocab = vocab.split()

vocab = set(vocab)

# load all training reviews

positive_docs = process_docs('txt_sentoken/pos', vocab, True)

negative_docs = process_docs('txt_sentoken/neg', vocab, True)

train_docs = negative_docs + positive_docs

# create the tokenizer

tokenizer = Tokenizer()

# fit the tokenizer on the documents

tokenizer.fit_on_texts(train_docs)

# sequence encode

encoded_docs = tokenizer.texts_to_sequences(train_docs)

# pad sequences

max_length = max([len(s.split()) for s in train_docs])

Xtrain = pad_sequences(encoded_docs, maxlen=max_length, padding='post')

# define training labels

ytrain = array([0 for _ in range(900)] + [1 for _ in range(900)])

# load all test reviews

positive_docs = process_docs('txt_sentoken/pos', vocab, False)

negative_docs = process_docs('txt_sentoken/neg', vocab, False)

test_docs = negative_docs + positive_docs

# sequence encode

encoded_docs = tokenizer.texts_to_sequences(test_docs)

# pad sequences

Xtest = pad_sequences(encoded_docs, maxlen=max_length, padding='post')

# define test labels

ytest = array([0 for _ in range(100)] + [1 for _ in range(100)])

# define vocabulary size (largest integer value)

vocab_size = len(tokenizer.word_index) + 1

# load embedding from file

raw_embedding = load_embedding('glove.6B.100d.txt')

# get vectors in the right order

embedding_vectors = get_weight_matrix(raw_embedding, tokenizer.word_index)

# create the embedding layer

embedding_layer = Embedding(vocab_size, 100, weights=[embedding_vectors], input_length=max_length, trainable=False)

# define model

model = Sequential()

model.add(embedding_layer)

model.add(Conv1D(filters=128, kernel_size=5, activation='relu'))

model.add(MaxPooling1D(pool_size=2))

model.add(Flatten())

model.add(Dense(1, activation='sigmoid'))

print(model.summary())

# compile network

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# fit network

model.fit(Xtrain, ytrain, epochs=10, verbose=2)

# evaluate

loss, acc = model.evaluate(Xtest, ytest, verbose=0)

print('Test Accuracy: %f' % (acc*100))

Running the example shows better performance.

Again, the training dataset is easily learned and the model achieves 76% accuracy on the test dataset. This is good, but not as good as using a learned Embedding layer.

This may be cause of the higher quality vectors trained on more data and/or using a slightly different training process.

...
Epoch 6/10
2s - loss: 0.0278 - acc: 1.0000
Epoch 7/10
2s - loss: 0.0174 - acc: 1.0000
Epoch 8/10
2s - loss: 0.0117 - acc: 1.0000
Epoch 9/10
2s - loss: 0.0086 - acc: 1.0000
Epoch 10/10
2s - loss: 0.0068 - acc: 1.0000
Test Accuracy: 76.000000

...

Epoch 6/10

2s - loss: 0.0278 - acc: 1.0000

Epoch 7/10

2s - loss: 0.0174 - acc: 1.0000

Epoch 8/10

2s - loss: 0.0117 - acc: 1.0000

Epoch 9/10

2s - loss: 0.0086 - acc: 1.0000

Epoch 10/10

2s - loss: 0.0068 - acc: 1.0000

Test Accuracy: 76.000000

In this case, it seems that learning the embedding as part of the learning task may be a better direction than using a specifically trained embedding or a more general pre-trained embedding.

Summary

In this tutorial, you discovered how to develop word embeddings for the classification of movie reviews.

Specifically, you learned:

How to prepare movie review text data for classification with deep learning methods.
How to learn a word embedding as part of fitting a deep learning model.
How to learn a standalone word embedding and how to use a pre-trained embedding in a neural network model.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Note: This post is an excerpt chapter from: “Deep Learning for Natural Language Processing“. Take a look, if you want more step-by-step tutorials on getting the most out of deep learning methods when working with text data.

258 Responses to Deep Convolutional Neural Network for Sentiment Analysis (Text Classification)

Alexander October 30, 2017 at 7:11 pm #

Thank you for interest work.
Jason, help me please.
Imagine that we have an dataset which contains reviews with very different lengths (from just two words “good film” to long description “I remember the first work of this director….”.
Which length should we choose when we talk about “pad_sequence”?

Reply
- Jason Brownlee October 31, 2017 at 5:32 am #
  
  All reviews must be padded to the same length.
  
  You can use the longest review, you can use the average length, etc. Try a few and see what works best.
  
  Reply
JingChunzhen October 31, 2017 at 2:37 am #

Hi Jason
I think it’s difficult to distinguish ”good” and “bad” using word2vec, but it’s important in sentiment analysis anyway .
Is there a good method to solve this ? Thank you and sorry for my poor English .

Reply
- Jason Brownlee October 31, 2017 at 5:35 am #
  
  Generally, the model will minimize a loss function. We want the loss to be low.
  
  I agree, it feels like clustering to some degree.
  
  Reply
Alexander October 31, 2017 at 10:38 pm #

Thanks for feedback.
Jason, sorry i don’t understand one thing. Help me please.
I read you works about technique “Word Embeddings”. And I saw that when you use Embedding layer with Convolutional layer as feature extractor you always use 1D version.
Since the machine sees the sentence in form of matrix why we don’t base on Conv2D?

Reply
- Jason Brownlee November 1, 2017 at 5:47 am #
  
  Good question.
  
  A sequence of words is a one dimensions vector of integers. The embedding maps each integer onto a real vector, so now we have a sequence of vectors. It is still a one-dimensional sequence, it just so happens each item in the sequence has many features.
  
  It is not like an image that has spatial relationships in two dimensions.
  
  Does that help?
  
  Reply
Alexander November 1, 2017 at 6:19 pm #

Thank for clear explanation, Jason.
I try to find an opportunity to use knowledges we have in Embedding layer the best way.

Reply
- Jason Brownlee November 2, 2017 at 5:08 am #
  
  I’m glad it helped Alexander.
  
  Reply
Ramesh Subrahmanyam November 5, 2017 at 9:12 am #

Excellent post. Thanks.

Reply
- Jason Brownlee November 6, 2017 at 4:47 am #
  
  Thanks.
  
  Reply
Amit Adesara November 15, 2017 at 9:54 pm #

Thank you for amazing posts. Eagerly waiting for your new book launch. I had one query on test,train split. Rather than defining a X and Y test/train, can’t we just use split model from sklearn. Doesn’t is automatically split the entire dataset into train and test? Thank you.

Reply
- Jason Brownlee November 16, 2017 at 10:30 am #
  
  Thanks Amit!
  
  Yes, you can use sklearn if you wish.
  
  Reply
Abdur Rehman Nadeem December 15, 2017 at 8:50 am #

Hi jason,

Sorry once again I am asking the same question which I mentioned in some other blog that what if I am using my own dataset. How can I make my dataset compatible to the above pre-built dataset ? I have a dataset of tweets so how can i make my dataset of positive and negative tweets and then use them as a test and train set ?

Reply
- Jason Brownlee December 15, 2017 at 3:32 pm #
  
  The post above shows how to prepare data.
  
  Also, this post shows how to clean text:
  https://machinelearningmastery.com/clean-text-machine-learning-python/
  
  Reply
Vladimir January 23, 2018 at 1:14 am #

Thank you Jason for a valuable article! It is an exploration on it’s own.
I also get better results if Embedding Layer is learned as part of the model, rather than using predifined embeddings.

First I was very much excited about using GloVe embeddings, yet it gives quite poor results. I think it is because real-world texts contain different word forms (for. example: cat, cats), while GloVe has embeddings only for a single form of each word (for. example: cat). And even a small deviation is considered as a different word, and thus ignored. Probably if this “word-forms” problem is solved, GloVe could show tremendous results…

Reply
- Jason Brownlee January 23, 2018 at 8:04 am #
  
  You may be right in the general case. It is also critical that the vocab/word usage/etc is a good fit for the problem being solved.
  
  Reply
Michelle February 27, 2018 at 2:05 pm #

Hi,

I am following this example to classify malicious urls. I am able to filter out words, encode and pad my training and testing data. However, when I apply the training data to the model, I get errors.

Do you have the code up on github? I’d like to run through the code to debug.

Thanks!

Reply
- Jason Brownlee February 27, 2018 at 2:57 pm #
  
  Sorry, I cannot help you debug your code.
  
  Reply
A.S March 13, 2018 at 7:56 pm #

Hi, This is very helpful thanks! Is there a java version of this please?
Thanks

Reply
- Jason Brownlee March 14, 2018 at 6:17 am #
  
  Sorry, I do not have Java versions at this stage.
  
  Reply
pANTHER March 24, 2018 at 9:48 pm #

how this fucntion

“encoded_docs = tokenizer.texts_to_sequences(test_docs)”

index test vocabs when we didn’t call

“tokenizer.fit_on_texts(train_docs)”

on test_docs.

Reply
- Jason Brownlee March 25, 2018 at 6:29 am #
  
  It uses the vocab learned from the training data to index words in test. If there are new works, they are marked as zero.
  
  Reply
  - pANTHER April 1, 2018 at 1:31 am #
    
    Thanks, really appreciate your help.
    
    Reply
pANTHER March 24, 2018 at 11:35 pm #

can we use CNN with multiclass problem?

Reply
- Jason Brownlee March 25, 2018 at 6:30 am #
  
  Yes.
  
  Reply
  - Harini V January 26, 2021 at 11:24 pm #
    
    Why use cnn because lots of deep learning algorithms are there
    
    Reply
    - Jason Brownlee January 27, 2021 at 6:07 am #
      
      CNN is effective for some problems, e.g. works very well for text classification tasks compared to other methods (on average).
      
      Reply
Jin April 23, 2018 at 3:32 am #

Hi Jason, thanks for your sharing, For word embedding layer in keras, what is the window size? For example, in word2vec, we can set the window size to 5, but in keras embedding layer, there is no parameter for that.

Reply
- Jason Brownlee April 23, 2018 at 6:23 am #
  
  The window size is the number of words around each word to consider, e.g. the context, when building the model.
  
  Keras does not need this, as the embedding is built via back-propagating error during training, ideally using BPTT which builds up error over input time steps.
  
  From my experiments, I find the Keras approach results in better skill than using a prebuilt model. It should be the other way around according to intuition, but not in my experience.
  
  Reply
  - Jin Zhou April 23, 2018 at 10:45 am #
    
    I read your blog about building language model with LSTM, and I tried work with pre-trained embedding weights, actually results in lower perplexity, but a little lower BLEU. So, it is hard for me to say which one is better.
    
    Thanks for your blogs! I learned a lot from them.
    
    Reply
    - Jason Brownlee April 23, 2018 at 2:53 pm #
      
      Nice work!
      
      Reply
Anam May 8, 2018 at 4:15 pm #

Dear Jason,
In above post, you have used a CNN model after data preparation and embedding layer training tasks. Kindly, can you help that how I can perform
(i) .Data Preparation
(ii).Train Embedding Layer
to use in LSTM Recurrent Neural Networks instead of CNN OR How can I use LSTM Recurrent Neural Networks instead of CNN after these two tasks?
Thanks for your time!

Reply
- Jason Brownlee May 9, 2018 at 6:14 am #
  
  Here is an example:
  https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/
  
  Reply
  - Anam May 9, 2018 at 1:14 pm #
    
    Thanks Sir, but I have an issue that, In the provided link training of embedding layer is not performed. Kindly can you guide me how to train embedding layer for LSTM RNN Model?
    
    Reply
    - Jason Brownlee May 9, 2018 at 2:57 pm #
      
      Sorry, I cannot prepare a specific case for you.
      
      You can use a code example that shows how to use an embedding on the front end and combine it with the tutorial.
      
      Reply
Anam May 9, 2018 at 2:59 pm #

Thanks Sir for your guidance.

Reply
- Jason Brownlee May 10, 2018 at 6:26 am #
  
  You’re welcome.
  
  Reply
Maryam June 13, 2018 at 8:04 am #

Hi Jason,
I am so grateful for the practical tutorial. I am a beginner at keras and deep learning. I applied a data set which is about 2 diseases. I wanna to train RNN via GloVe in order to classification but the output result is as follow :
precision 0.0
FPR 0.9999999999807433
TPR 0.0
FNR 0.999999999980609
specificity 0.0
accuracy 0.0
F_score 0.0

I think just because GloVe did not pre-train for disease dataset???
am I right ?? or it needs to train in many eopche?
thank you for guiding me.
Best
Maram

Reply
- Jason Brownlee June 13, 2018 at 3:04 pm #
  
  There are many possible reasons. Perhaps this framework will help you debug your model:
  https://machinelearningmastery.com/improve-deep-learning-performance/
  
  Reply
Sarah June 21, 2018 at 9:51 am #

Hi Jason,
that was awesome like the others tutorials. I do not have a separate data set for train dataset and test data set as I want to select test dataset randomly each time. so I have already done it by this statment:
‘from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x_datasetpad,y_datasetpad,stratify=y_datasetpad,test_size=0.25)’
according to your tutorial, you set ytest in this way :
ytest = array([0 for _ in range(100)] + [1 for _ in range(100)])

I have already created x_dataset but I do not know how i can create ydataset??
please guide me Jason, I realy need it.

Sarah

Reply
- Jason Brownlee June 21, 2018 at 4:50 pm #
  
  The output will be the sentiment, or an integer representing the sentiment classes for each input review.
  
  Reply
Hegg June 21, 2018 at 11:59 pm #

Hello Jason,

I was following your awesome tutorial but in the end I was asking myself: How can I predict a new sentence represented as a string for its sentiment? I think there will be some more people who would like to test their trained network for some example inputs. Could you maybe add such a section to this tutorial or give a short explanation on how to do this exactly? In such a way:

exampleSentence = ‘The weather is very good today’
prediction = trainedModel.predict(exampleSentence)

print(sentimentForPrediction(prediction))

Kind regards

Hegg

Reply
- Jason Brownlee June 22, 2018 at 6:10 am #
  
  Yes, you can use:;
  
  X = ... yhat = model.predict(X)
  
  1
  2
  
  X = ...
  yhat = model.predict(X)
  
  Reply
  - Srijan Verma July 18, 2018 at 12:35 am #
    
    Hi Jason,
    I tried implementing one of your codes for predicting the sentiment for a sentence.
    First i trained the model the way you have. I am getting an accuracy of 86% on test data.
    This is the function that i used for predicting on a new sentence.
    
    # classify a review as negative (0) or positive (1)
    def predict_sentiment(review, vocab, tokenizer, model):
    # clean
    tokens = clean_doc(review)
    # filter by vocab
    tokens = [w for w in tokens if w in vocab]
    # convert to line
    line = ‘ ‘.join(tokens)
    #print(line)
    # encode
    tokenizer.fit_on_texts(line)
    encoded = tokenizer.texts_to_sequences(line)
    #print(encoded)
    # prediction
    max_length = max([len(s.split()) for s in line])
    pred = pad_sequences(encoded, maxlen=1317, padding=’post’)
    #print(pred)
    yhat = model.predict(pred, verbose=0)
    return round(yhat[0,0])
    
    # test positive text
    text = ‘this is a good movie’
    print(predict_sentiment(text, vocab, tokenizer, model))
    # test negative text
    text = ‘This is a bad movie.’
    print(predict_sentiment(text, vocab, tokenizer, model))
    
    somehow, I am getting a ‘0’ in both the cases. I tried using other test cases as well, but it is always giving a zero.
    
    I cant find where I am going wrong.
    
    Any insights would be really helpful.
    
    Thanks!
    
    Reply
    - Jason Brownlee July 18, 2018 at 6:37 am #
      
      Sorry, I don’t have the capacity to debug your code. Perhaps try stackoverflow?
      
      Reply
      - Srijan July 18, 2018 at 7:08 pm #
        
        Let us say I have a raw text which I want to predict using model.predict().
        First step would be to assign the same index value (on which the model was trained on) to each of the words in the raw text .
        Then we would have to convert it to sequences (using texts_to_sequences())
        Then will be the padding (max_length would be the same of that of the training set)
        And then this will be given to the model.predict()
        
        Is that correct ? ^
      - Jason Brownlee July 19, 2018 at 7:48 am #
        
        Yes. All preparation performed to training data must also be performed to any new data when you want to make a prediction with the trained model.
      - Nanda Kishor M pai April 11, 2020 at 4:13 pm #
        
        i tried to predict the same way as mentioned above and irrespective of review i am getting a value around 0.5100..what to do?
      - Jason Brownlee April 12, 2020 at 6:14 am #
        
        Perhaps try re-fitting the model from scratch?
        Perhaps try alternate configurations of the model or training?
        Perhaps try different data when making predictions?
Sara June 26, 2018 at 12:25 am #

Hi Jason,

Thank you so much for your great post. I always enjoy your work and they help me a lot.

My question is that how can I handle other features and text feature at the same time in my classification. I mean if I have other features like the number of words, the number of punctuation, category, etc for each review how can I use them in my classification.

Best

Reply
- Jason Brownlee June 26, 2018 at 6:39 am #
  
  Great question, perhaps a two-headed model would be appropriate. One head for text data and one with a vector summarizing the document.
  
  This tutorial will help:
  https://machinelearningmastery.com/keras-functional-api-deep-learning/
  
  Reply
Anjali Batra July 31, 2018 at 2:18 am #

Hi Jason, awesome article. Can you please tell which python version can I use to run the code?
As nltk wont work on pythong 2.7 and Keras is not available for python 3.6.

Looking forward to your response.

Thanks,
Anjali Batra

Reply
- Jason Brownlee July 31, 2018 at 6:10 am #
  
  Python 3.5 or 3.6. I believe Python 2.7 would work too, perhaps with minor tweaks.
  
  Reply
Shreyas Jeet July 31, 2018 at 7:28 pm #

Hi Jason,
Great tutorial, helped a lot.
I am working on a multi-class classification , could you help me on how I would have to tweak the network to facilitate a 8 class classifier.

Thank You

Reply
- Jason Brownlee August 1, 2018 at 7:42 am #
  
  Change the number of nodes in the output layer to 8, the transfer function to softmax and the loss function to categorical cross entropy.
  
  Reply
Alberto August 27, 2018 at 1:15 am #

Hi Jason,
I’ve a question. To develop the vocab I’ve to use both training and testing reviews?
Thanks a lot.

Reply
- Jason Brownlee August 27, 2018 at 6:11 am #
  
  Ideally, you would just use training data to develop your vocab when evaluating the model.
  
  You would use all data when preparing the final model.
  
  It really comes down to your project goals.
  
  Reply
Arghyadeep Giri September 20, 2018 at 1:38 pm #

Dear Jason,

Absolutely loved the tutorial. Nevertheless, I had a question regarding the convolution. You already answered a question regarding that, but I still have a doubt. You mentioned that the output of the embedding layer is a one-dimensional sequence.
But the output shape of the embedding layer is (None, 1317, 100). I know the “None” refers to the batch size of the number of training examples. But, how is convolution possible with a (1317,100) shape with a one-dimensional filter size of 8. I want to get into the depth of it. How are these values getting multiplied? Since the output shape of the convoluted layer is (1310,32), the only way I can make sense is that for every filter there are 100 values which are getting multiplied with every sequence for a certain word. Kindly correct me if I am wrong.

Regards,
Arghyadeep

Reply
- Jason Brownlee September 20, 2018 at 2:28 pm #
  
  No, the input is a sequence of integers. The output is a sequence of vectors.
  
  Reply
  - Arghyadeep Giri September 20, 2018 at 3:04 pm #
    
    Yes, absolutely! But I believe that is for the embedding layer.
    
    But my question is about the convolution layer. How is the input shape (None, 1317, 100) getting converted to (None, 1310, 32)?
    
    The 32 comes from the number of filters.
    1317 gets converted to 1310 for (1317-8+1) which is the kernel size.
    What about the number 100?
    
    Reply
    - Jason Brownlee September 21, 2018 at 6:22 am #
      
      The 100 is the length of each word vector.
      
      Reply
      - Arghyadeep Giri September 21, 2018 at 10:05 am #
        
        Thanks 🙂
Ikib Kilam November 16, 2018 at 6:28 pm #

Jason,

In the ‘Training Embedding Layer’ code, why did you add a 1 to the size of vocabulary, before passing the size of the vocabulary to the Embedding? You say, ‘….plus one for unknown words…’ – but by definition there are no unknown words in the training data. When I run your code without adding the one, it croaks at the point when it is fitting the training data. It runs fine when I add the 1 to the vocabulary. I am not certain why, and hoping you can elaborate.

Reply
- Jason Brownlee November 17, 2018 at 5:44 am #
  
  There can be unknown words when using the model, e.g. test data or new data.
  
  Reply
  - ikib November 17, 2018 at 9:46 am #
    
    Thanks Jason. If I understood your response correctly…Unknown words in the test data or any new data, are not ascribed an integer encoding, because we are using the same tokenizer that was built for the training data, when working with these new data sets as well. So unknown words will not show up when we run these new datasets. I am still unsure why we need to add a 1, since again by definition the training data has no unknown words. I am missing something, esp. since when 1 is not added the code croaks at the fitting of training data stage, and well before we work with test/other datasets. One difference from your code is that I am using a MLP instead of CNN to train. I shall attempt to dig into this more, however if you have additional insight, please do reply.
    
    Reply
    - Jason Brownlee November 18, 2018 at 6:35 am #
      
      Yes. We add one because known words start at 1 not 0, e.g. 1-offset.
      
      Reply
Ikib Kilam November 18, 2018 at 2:00 pm #

Thank you. Of course. I should have seen the offset.

Reply
Johnny November 23, 2018 at 2:50 pm #

Hi Jason,

Thanks for your sharing. I have trained a CNN model with pre-trained Glove vectors to do the text classification on resume data. I have the result in the vector forms and I wonder how could we transfer it back to text so that we could interpret it with actual meanings. Or is there better ways to do it?

Thanks,

Reply
- Jason Brownlee November 24, 2018 at 6:28 am #
  
  I’m not sure I follow. You can plot words from the embedding, but it won’t tell you much about the resumes.
  
  Reply
chakib arsalan November 27, 2018 at 2:05 am #

Hello, i have two doubt ,
– you explain the ( 3. Train Embedding Layer ) and ( 4. Train word2vec Embedding) ,
we must chose on of those method or one complete the other ?

– Also , if i wan’t to give a new sentence and classify it (positive or negative), how can i do that ?

Reply
- Jason Brownlee November 27, 2018 at 6:36 am #
  
  You can use a final model to make a prediction as follows:
  
  yhat = model.predict([asdasdasd])
  
  Reply
  - chakib arsalan November 29, 2018 at 10:17 am #
    
    When i do a predict for a new sentence for example :
    
    phrase = “very bad feeling”
    tokens = tokenizer.texts_to_sequences([phrase])
    model.predict(np.array(tokens))
    
    its give me :
    array([[0.99999934]], dtype=float32)
    
    how can i now it about (positive or negative) sentiment ?
    
    Reply
    - Jason Brownlee November 29, 2018 at 2:39 pm #
      
      You can use predict_class() to get the 0/1 value or you can interpret the probability directly.
      
      I think values close to 0 is negative, values close to 1 is positive.
      
      Reply
      - chakib arsalan December 11, 2018 at 12:17 am #
        
        is word2Vec great than Embedding Layer simple ?
      - Jason Brownlee December 11, 2018 at 7:45 am #
        
        They are different, perhaps try both and see what works best for your specific dataset.
Avir December 30, 2018 at 8:16 pm #

Hi,
Really good tutorial, but i need some clarifications.
In this case you use word2vec model to use for the rnn model. That rnn model is used to predict the sentiments of words. Actually what is the exact use(purpose) of that word2vec model that you created? why these word embeddings really helpful for these analysis?

Reply
- Jason Brownlee December 31, 2018 at 6:10 am #
  
  It creates a distributed representation of words, more here:
  https://machinelearningmastery.com/what-are-word-embeddings/
  
  Reply
Sam Donaldson January 18, 2019 at 9:39 am #

Hi Jason, great writeup. I’m trying to get an intuitive understanding of how these higher dimensional representations effectively, cluster. It seems slightly magical at the moment. Is it because of the constant training to minimize the loss? And would backprop work downwards toward the embedding layer and modify the weights in those word vectors accordingly? If this correct, I’m to understand how this leads to clustering.

Reply
- Jason Brownlee January 18, 2019 at 10:17 am #
  
  The result is similar imports map to similar vectors or points in the n-dimensional space.
  
  Training pushes similar things together as an efficiency – it is a lower cost to the loss function to group the vectors.
  
  It is amazing. Truely!
  
  Reply
  - Sam Donaldson January 19, 2019 at 7:01 am #
    
    Super helpful. Last few questions just to get a deeper understanding of what’s going on.
    
    Question 1: In order for this magic to really work, the training examples must contain data related to one another? And some of this data needs to be shared across other training examples? I can envision a graph of data where similar data are already connected to each other and the algorithm is telling us how they’re clustered and their strengths by filling in the vector values by fitting a function. Am I on the right track?
    
    Question 2: At a higher level, should we think of using embeddings when we want to find similarities or create clusters / groupings around data? I’m trying to understand when best to think of converting categorical fields to embeddings. For example, how would you treat dates or categories in structured data using deep learning?
    
    Thanks Jason.
    
    Reply
    - Jason Brownlee January 19, 2019 at 8:16 am #
      
      Yes, generally richer training data is desired, more for the model to figure out. This is hard, so we solve it with volume, thousands or millions of examples.
      
      Yes, when you want the model to learn how the categories should be grouped/best-grouped in the context of the problem that you’re solving. Days, months, stores, etc. These are really good examples where embeddings can help. It’s a new field, and few papers other than random kaggle write-ups. I hope to go deep on this later in the year.
      
      Reply
      - Sam Donaldson January 25, 2019 at 6:21 am #
        
        Thanks Jason. Few more queries as I study more.
        
        — The examples I’ve seen thus far to create embeddings have been in two forms:
        
        1) A collaborative filtering (Netflix etc..) that take two embeddings, dot product them, and pass them throughs some layers with some output. The output is then compared to the label, and backprop works to fit the output by adjusting the embedding weights.
        
        2) In this case, the task is to compute say some sales projection but the model takes in as input some embeddings that represent some categorical variables, like day of month or year etc..
        
        I need some help understanding how meaningful weights are learnt in 2) for the embeddings of these categorical variables. I ask because the loss function in 2) is directly related to the sales projection, as opposed to say 1) in which the objective is to get the dot product of the embeddings as close as possible to the label.
        
        Can you shed some light on 2) and how these embeddings of categorical variables get learnt in more structured neural net problems? How do the weights become meaningful?
      - Jason Brownlee January 25, 2019 at 8:49 am #
        
        It learns how to best group the days or months based on the loss for the specific task. There’s not a lot to it.
        
        Perhaps prototype an example and see how it goes?
Ratha January 21, 2019 at 10:37 pm #

Is it possible to read indian languages to create word2vec

Reply
- Jason Brownlee January 22, 2019 at 6:22 am #
  
  Sure.
  
  Reply
  - Ratha January 22, 2019 at 2:42 pm #
    
    Thank you. Could you please share the code to create word2vec for other languages? Where should I need to change in Word2vec function?
    
    Reply
    - Jason Brownlee January 23, 2019 at 8:43 am #
      
      It is the same code, just new training data.
      
      Perhaps start here:
      https://machinelearningmastery.com/develop-word-embeddings-python-gensim/
      
      Reply
Radifan January 27, 2019 at 6:47 pm #

Hello Mr. Jason

Your article is very useful for my study, so i would ask for question :

In section which use CNN with Conv1D layer after embedding layer in the model, what is result of Conv1D layer ? I mean whether result of this layer is value, pattern, sentiment or another ?

Reply
- Jason Brownlee January 28, 2019 at 7:13 am #
  
  It will be a convolution of the distributed representation of the input.
  
  Perhaps I don’t fully understand your question?
  
  Reply
  - Radifan January 29, 2019 at 3:18 pm #
    
    I mean, the convolution of the distributed representation of the input whether represent a sentiment feature of input tweet or something else ?
    
    Reply
    - Radifan January 29, 2019 at 4:23 pm #
      
      Oh sorry.. In my question that’s not input tweet but input text review
      
      Reply
    - Jason Brownlee January 30, 2019 at 8:05 am #
      
      Yes, the aggregate or concat of the distributed representation of each word would together be the “text”, on which the conv is applied. This does not really mean anything though.
      
      Reply
congmin January 28, 2019 at 7:04 pm #

Hi, Jason: The IMDB data set used in this tutorial is different from the IMDB data set from Stanford university?

Reply
- Jason Brownlee January 29, 2019 at 6:09 am #
  
  I believe it is the same data.
  
  Reply
congmin January 29, 2019 at 6:41 pm #

But Stanford data has 50000 reviews, and this dataset has only 2000.

Reply
- Jason Brownlee January 30, 2019 at 8:06 am #
  
  Perhaps it is a subset, you can learn more here:
  https://keras.io/datasets/#imdb-movie-reviews-sentiment-classification
  
  Reply
  - congmin January 30, 2019 at 1:49 pm #
    
    “Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer “3” encodes the 3rd most frequent word in the data. ” Regarding “. Regarding this, I am wondering how to encode words with same frequencies, because in practice many words in a large corpus will occur the same times. But array index doesn’t allow duplicates. Any idea of how to deal with this in Keras?
    
    Reply
    - Jason Brownlee January 30, 2019 at 2:43 pm #
      
      They are still rank ordered and assigned unique integers.
      
      Reply
      - congmin January 30, 2019 at 5:18 pm #
        
        @Jason, for instance, if two words ‘dog’ and ‘cat’ both occur 3 times, they will each get a different index? But the doc says “word are indexed by overall frequency”. How can the same freq count ‘3’ index two different words? Still puzzled…
      - Jason Brownlee January 31, 2019 at 5:27 am #
        
        No, all instances of dog would have the same index in the vocab.
congmin January 29, 2019 at 6:43 pm #

With the Glove embedding, I got 86% accuracy with slightly complex model and smaller batch size=8

# define model
model = Sequential()
model.add(embedding_layer)
model.add(Conv1D(filters=32, kernel_size=8, activation=’relu’))
model.add(MaxPooling1D(pool_size=2))
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dense(12, activation=’relu’))
model.add(Dropout(0.4))
model.add(Dense(1, activation=’sigmoid’))# compile network
model.compile(loss=’binary_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])
# fit network
model.fit(Xtrain, ytrain, epochs=10, batch_size=8, verbose=2)

Reply
- Jason Brownlee January 30, 2019 at 8:06 am #
  
  Nice work!
  
  Reply
- Congmin January 31, 2019 at 6:36 am #
  
  “No, all instances of dog would have the same index in the vocab.” What about dog’s index? it also occurs 3 times.
  
  Reply
  - Jason Brownlee January 31, 2019 at 2:13 pm #
    
    I don’t follow, can you elaborate on your question?
    
    Reply
    - congmin January 31, 2019 at 3:01 pm #
      
      Here: if two words ‘dog’ and ‘cat’ both occur 3 times, they will each get a different index? But the doc says “word are indexed by overall frequency”. My question is, how are two different words with a same frequency indexed?
      
      Reply
      - Jason Brownlee February 1, 2019 at 5:31 am #
        
        Sorry for the confusion.
        
        The frequency is calculated for each word, then words are rank ordered and assigned an index based on their rank ordering.
        
        dog = 1
        cat = 2
        …
    - congmin February 1, 2019 at 2:10 pm #
      
      dog = 1
      cat = 2
      
      If both words has the same frequency, i.e. 100, are how are they ranked?
      
      Reply
      - Jason Brownlee February 2, 2019 at 6:06 am #
        
        One before the other, but the specific order chosen by the sorting algorithm does not matter. Just like when sorting any list of things and two items have the same value. Some order must be chosen in the end and it’s usually whichever the sort algorithm encountered first.
congmin January 30, 2019 at 5:21 pm #

“ytest = array([0 for _ in range(100)] + [1 for _ in range(100)])”, @Jason, is it ‘0’ for positive and ‘1’ for negative? Usually it’s the other way. Just curious

Reply
- Jason Brownlee January 31, 2019 at 5:29 am #
  
  Yes in this case.
  
  Reply
Ratha February 18, 2019 at 4:13 pm #

How can i use word2vec and number of adjectives as features in CNN model? How can i add number of adjectives as a layer?

Reply
- Jason Brownlee February 19, 2019 at 7:20 am #
  
  Perhaps it would be a separate input that is merged with the word2vec output?
  
  Reply
Ratha February 18, 2019 at 4:37 pm #

Hi Jason,
Thanks for this wonderful tutorial.
I tried your word2vec model for my dataset.
I have a doubt in this line:

# create the embedding layer
embedding_layer = Embedding(vocab_size, 100, weights=[embedding_vectors], input_length=max_length, trainable=False)

Instead of weights=[embedding_vectors] i used fasttext model. I got same accuracy for both. Why is that? Obviously something went wrong. I used the same code as yours.

Moreover how can i use number of adjectives as features in addition to word2vec in CNN?

Reply
- Jason Brownlee February 19, 2019 at 7:22 am #
  
  Not sure why you’re getting the same results.
  
  You can provide static variables as a separate input to the model, this tutorial will show you how to develop a multiple input model:
  https://machinelearningmastery.com/keras-functional-api-deep-learning/
  
  Reply
Radifan February 20, 2019 at 11:04 pm #

Hi Mr. Jason
Thanks for this great explanation.

I’m currently working on sentiment analysis project which extract positive and negative sentiment on tweet from twitter.
I tried to use tutorial above as guide for my project, i.e. use word embedding and convolutional neural network for tweet sentiment analysis.
But when this applied on several datasets, the accuracy result is low i.e. about 66.6 %, 67.6%, and 69 %.
the question, what should I do to improve accuracy for my project ? what changes must be made to improve its accuracy ?

Reply
- Jason Brownlee February 21, 2019 at 8:11 am #
  
  I have some suggestions here:
  https://machinelearningmastery.com/improve-deep-learning-performance/
  
  Reply
Steve March 3, 2019 at 2:12 pm #

Hi Mr. Jason,

Thanks!, this is a great tutorial.
Need a small help with what I’m working right now.

I’m trying to build a neural network that predicts a score of 1 to 5 for a given review.
The dataset what I’m currently uses has two fields – review text | rating.
Rating is from a range of 1 to 5 (in floats)

I followed your steps but the y values are 1s and 0s in your instance. But my dataset has a range of 1 to 5 for the rating.
My model should predict a single value from range of 1 to 5

Do you know the correct way that I should pass the y values to the model ?

I’m looking forward to hear from you.

Thanks.

Reply
- Jason Brownlee March 4, 2019 at 6:57 am #
  
  Perhaps try modeling it as regression and then try multi-class classification and compare performance?
  
  Reply

Steve March 4, 2019 at 9:39 am #

Hi Mr. Jason,

Just checking whether i’m doing things right, because when i predict with real data, I get all class 4 prediction 🙁
The accuracy is really low – around 50 – 65
Below is how I’ve configured.

num_classes = 5
embedding_dim = 300
epochs = 10
batch_size = 128
max_len = 250

# converting points to classes for targets (y)
def points_to_class(self, points):
        if float(points) < 1.5:
            return 0
        elif float(points) < 2.5:
            return 1
        elif float(points) < 3.5:
            return 2
        elif float(points) < 4.5:
            return 3
        else:
            return 4

print('----classifying ratings to classes----')

classified_ratings = reviews_df['rating'].apply(self.points_to_class)
print(classified_ratings.value_counts())
y = to_categorical(classified_ratings, num_classes)


I'm using "glove.840B.300d.txt" as the pre-trained embeddings

embedding_layer = Embedding(len(word_index) + 1, embedding_dim, weights=[emb_matrix], input_length=max_len, trainable=False)


# define model

model = Sequential()
model.add(embedding_layer)
model.add(Conv1D(filters=128, kernel_size=5, activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(num_classes, activation='softmax'))

# compile network
        model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

num_classes = 5

embedding_dim = 300

epochs = 10

batch_size = 128

max_len = 250

# converting points to classes for targets (y)

def points_to_class(self, points):

if float(points) < 1.5:

return 0

elif float(points) < 2.5:

return 1

elif float(points) < 3.5:

return 2

elif float(points) < 4.5:

return 3

else:

return 4

print('----classifying ratings to classes----')

classified_ratings = reviews_df['rating'].apply(self.points_to_class)

print(classified_ratings.value_counts())

y = to_categorical(classified_ratings, num_classes)

I'm using "glove.840B.300d.txt" as the pre-trained embeddings

embedding_layer = Embedding(len(word_index) + 1, embedding_dim, weights=[emb_matrix], input_length=max_len, trainable=False)

# define model

model = Sequential()

model.add(embedding_layer)

model.add(Conv1D(filters=128, kernel_size=5, activation='relu'))

model.add(MaxPooling1D(pool_size=2))

model.add(Flatten())

model.add(Dense(num_classes, activation='softmax'))

# compile network

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

If you see anything that can be improved, please share 🙂

I’m looking forward to hear from you.

Thanks.

Jason Brownlee March 4, 2019 at 2:18 pm #

I have some suggestions for diagnosing and improving your model performance here:
https://machinelearningmastery.com/start-here/#better

Reply
- Steve March 4, 2019 at 4:22 pm #
  
  Highly appreciated, will get back to you after modification 🙂
  Thanks ! 🙂
  
  Reply
  - Steve March 20, 2019 at 10:32 pm #
    
    Hi Mr. Jason,
    
    I’ve tried your suggestions and it was really helpful.
    Right now my model looks like below.
    
    model = Sequential()
    model.add(Embedding(vocab_size, 100, input_length=max_length))
    model.add(Conv1D(filters=32, kernel_size=8, activation=’relu’))
    model.add(MaxPooling1D(pool_size=2))
    model.add(Flatten())
    model.add(Dense(100, activation=’relu’))
    model.add(Dropout(0.5))
    model.add(Dense(50, activation=’relu’))
    model.add(Dropout(0.5))
    model.add(Dense(5, activation=’softmax’))
    
    model.compile(loss=’categorical_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])
    
    callback = [PrintDot(), tensorboard]
    
    # fit network
    model.fit(final_X_train, final_Y_train, epochs=5, batch_size=128, validation_split=0.1, callbacks=callback)
    
    *I’m building a model to output a rating (1-5) for a given review. I’m training the model on more than 400,000 reviews but after Downsampling it brings to 35,000 reviews per class.
    
    *My Problem is, when i’m training, train acc is increasing and train loss is decreasing but though val_acc increases, the val_loss decreases after 2-3 epochs. This is one of the training records.
    
    ## 5 epochs – 128 batch size
    # loss: 1.0795 – acc: 0.5098 – val_loss: 0.8838 – val_acc: 0.6141
    # loss: 0.8516 – acc: 0.6331 – val_loss: 0.8025 – val_acc: 0.6596
    # loss: 0.7026 – acc: 0.7108 – val_loss: 0.7508 – val_acc: 0.7037
    # loss: 0.5383 – acc: 0.7922 – val_loss: 0.7744 – val_acc: 0.7265
    # loss: 0.4082 – acc: 0.8505 – val_loss: 0.8536 – val_acc: 0.7405
    # The total accuracy is : 0.6314016726089743
    
    This is surely an overfitting behaviour.
    **But when i try to increase the data with another 200,000 data samples, the val_acc started to decrease after 2nd epoch as well.
    
    Do you have any opinions for this issue, help is highly appreciated.
    
    I’m looking forward to hear from you.
    
    Thanks.
    
    Reply
    - Jason Brownlee March 21, 2019 at 8:15 am #
      
      You might have to try regularization methods to attack the overfitting.
      
      I have tons of tutorials on this exact topic, start here:
      https://machinelearningmastery.com/start-here/#better

Liangqun March 6, 2019 at 1:45 pm #

In the example, the learned weights perform better than the pre-trained weights. Is there a way to get the learned weights from model? And how are they different via visualization?

Reply
- Jason Brownlee March 6, 2019 at 2:47 pm #
  
  Yes, you can save the embedding layer and use it to get vectors for words.
  
  They will be specific to the neural net model, e.g. targeted towards minimizing loss.
  
  Reply
sun March 22, 2019 at 11:29 pm #

theano_compilation_error_qi7q8r0_

—————————————————————————

Reply
- Jason Brownlee March 23, 2019 at 9:30 am #
  
  Sorry to hear that, perhaps this will help you setup your environment:
  https://machinelearningmastery.com/setup-python-environment-machine-learning-deep-learning-anaconda/
  
  Reply
sun March 25, 2019 at 3:19 pm #

WARNING (theano.gof.compilelock): Overriding existing lock by dead process ‘45020’ (I am process ‘13088’)

You can find the C code in this temporary file: C:\Users\DES\AppData\Local\Temp\theano_compilation_error_yr3062kk

—————————————————————————
Exception Traceback (most recent call last)
in ()
127 # define model
128 model = Sequential()
–> 129 model.add(embedding_layer)
130 model.add(Conv1D(filters=128, kernel_size=5, activation=’relu’))
131 model.add(MaxPooling1D(pool_size=2))

G:\anaconda\lib\site-packages\keras\engine\sequential.py in add(self, layer)
163 # and create the node connecting the current layer
164 # to the input layer we just created.
–> 165 layer(x)
166 set_inputs = True
167 else:

G:\anaconda\lib\site-packages\keras\engine\base_layer.py in __call__(self, inputs, **kwargs)
429 ‘You can build it manually via: ‘
430 ‘layer.build(batch_input_shape)‘)
–> 431 self.build(unpack_singleton(input_shapes))
432 self.built = True
433

G:\anaconda\lib\site-packages\keras\layers\embeddings.py in build(self, input_shape)
107 regularizer=self.embeddings_regularizer,
108 constraint=self.embeddings_constraint,
–> 109 dtype=self.dtype)
110 self.built = True
111

G:\anaconda\lib\site-packages\keras\legacy\interfaces.py in wrapper(*args, **kwargs)
89 warnings.warn(‘Update your ' + object_name + ' call to the ‘ +
90 ‘Keras 2 API: ‘ + signature, stacklevel=2)
—> 91 return func(*args, **kwargs)
92 wrapper._original_function = func
93 return wrapper

G:\anaconda\lib\site-packages\keras\engine\base_layer.py in add_weight(self, name, shape, dtype, initializer, regularizer, trainable, constraint)
247 if dtype is None:
248 dtype = K.floatx()
–> 249 weight = K.variable(initializer(shape),
250 dtype=dtype,
251 name=name,

G:\anaconda\lib\site-packages\keras\initializers.py in __call__(self, shape, dtype)
110 def __call__(self, shape, dtype=None):
111 return K.random_uniform(shape, self.minval, self.maxval,
–> 112 dtype=dtype, seed=self.seed)
113
114 def get_config(self):

G:\anaconda\lib\site-packages\keras\backend\theano_backend.py in random_uniform(shape, minval, maxval, dtype, seed)
2598 seed = np.random.randint(1, 10e6)
2599 rng = RandomStreams(seed=seed)
-> 2600 return rng.uniform(shape, low=minval, high=maxval, dtype=dtype)
2601
2602

G:\anaconda\lib\site-packages\theano\sandbox\rng_mrg.py in uniform(self, size, low, high, ndim, dtype, nstreams, **kwargs)
870 if nstreams is None:
871 nstreams = self.n_streams(size)
–> 872 rstates = self.get_substream_rstates(nstreams, dtype)
873
874 d = {}

G:\anaconda\lib\site-packages\theano\configparser.py in res(*args, **kwargs)
115 def res(*args, **kwargs):
116 with self:
–> 117 return f(*args, **kwargs)
118 return res
119

G:\anaconda\lib\site-packages\theano\sandbox\rng_mrg.py in get_substream_rstates(self, n_streams, dtype, inc_rstate)
777 # If multMatVect.dot_modulo isn’t compiled, compile it.
778 if multMatVect.dot_modulo is None:
–> 779 multMatVect(rval[0], A1p72, M1, A2p72, M2)
780
781 # This way of calling the Theano fct is done to bypass Theano overhead.

G:\anaconda\lib\site-packages\theano\sandbox\rng_mrg.py in multMatVect(v, A, m1, B, m2)
60 o = DotModulo()(A_sym, s_sym, m_sym, A2_sym, s2_sym, m2_sym)
61 multMatVect.dot_modulo = function(
—> 62 [A_sym, s_sym, m_sym, A2_sym, s2_sym, m2_sym], o, profile=False)
63
64 # This way of calling the Theano fct is done to bypass Theano overhead.

G:\anaconda\lib\site-packages\theano\compile\function.py in function(inputs, outputs, mode, updates, givens, no_default_updates, accept_inplace, name, rebuild_strict, allow_input_downcast, profile, on_unused_input)
315 on_unused_input=on_unused_input,
316 profile=profile,
–> 317 output_keys=output_keys)
318 return fn

G:\anaconda\lib\site-packages\theano\compile\pfunc.py in pfunc(params, outputs, mode, updates, givens, no_default_updates, accept_inplace, name, rebuild_strict, allow_input_downcast, profile, on_unused_input, output_keys)
484 accept_inplace=accept_inplace, name=name,
485 profile=profile, on_unused_input=on_unused_input,
–> 486 output_keys=output_keys)
487
488

G:\anaconda\lib\site-packages\theano\compile\function_module.py in orig_function(inputs, outputs, mode, accept_inplace, name, profile, on_unused_input, output_keys)
1839 name=name)
1840 with theano.change_flags(compute_test_value=”off”):
-> 1841 fn = m.create(defaults)
1842 finally:
1843 t2 = time.time()

G:\anaconda\lib\site-packages\theano\compile\function_module.py in create(self, input_storage, trustme, storage_map)
1713 theano.config.traceback.limit = theano.config.traceback.compile_limit
1714 _fn, _i, _o = self.linker.make_thunk(
-> 1715 input_storage=input_storage_lists, storage_map=storage_map)
1716 finally:
1717 theano.config.traceback.limit = limit_orig

G:\anaconda\lib\site-packages\theano\gof\link.py in make_thunk(self, input_storage, output_storage, storage_map)
697 return self.make_all(input_storage=input_storage,
698 output_storage=output_storage,
–> 699 storage_map=storage_map)[:3]
700
701 def make_all(self, input_storage, output_storage):

G:\anaconda\lib\site-packages\theano\gof\vm.py in make_all(self, profiler, input_storage, output_storage, storage_map)
1089 compute_map,
1090 [],
-> 1091 impl=impl))
1092 linker_make_thunk_time[node] = time.time() – thunk_start
1093 if not hasattr(thunks[-1], ‘lazy’):

G:\anaconda\lib\site-packages\theano\gof\op.py in make_thunk(self, node, storage_map, compute_map, no_recycling, impl)
953 try:
954 return self.make_c_thunk(node, storage_map, compute_map,
–> 955 no_recycling)
956 except (NotImplementedError, utils.MethodNotDefined):
957 # We requested the c code, so don’t catch the error.

G:\anaconda\lib\site-packages\theano\gof\op.py in make_c_thunk(self, node, storage_map, compute_map, no_recycling)
856 _logger.debug(‘Trying CLinker.make_thunk’)
857 outputs = cl.make_thunk(input_storage=node_input_storage,
–> 858 output_storage=node_output_storage)
859 thunk, node_input_filters, node_output_filters = outputs
860

G:\anaconda\lib\site-packages\theano\gof\cc.py in make_thunk(self, input_storage, output_storage, storage_map, keep_lock)
1215 cthunk, module, in_storage, out_storage, error_storage = self.__compile__(
1216 input_storage, output_storage, storage_map,
-> 1217 keep_lock=keep_lock)
1218
1219 res = _CThunk(cthunk, init_tasks, tasks, error_storage, module)

G:\anaconda\lib\site-packages\theano\gof\cc.py in __compile__(self, input_storage, output_storage, storage_map, keep_lock)
1155 output_storage,
1156 storage_map,
-> 1157 keep_lock=keep_lock)
1158 return (thunk,
1159 module,

G:\anaconda\lib\site-packages\theano\gof\cc.py in cthunk_factory(self, error_storage, in_storage, out_storage, storage_map, keep_lock)
1618 node.op.prepare_node(node, storage_map, None, ‘c’)
1619 module = get_module_cache().module_from_key(
-> 1620 key=key, lnk=self, keep_lock=keep_lock)
1621
1622 vars = self.inputs + self.outputs + self.orphans

G:\anaconda\lib\site-packages\theano\gof\cmodule.py in module_from_key(self, key, lnk, keep_lock)
1179 try:
1180 location = dlimport_workdir(self.dirname)
-> 1181 module = lnk.compile_cmodule(location)
1182 name = module.__file__
1183 assert name.startswith(location)

G:\anaconda\lib\site-packages\theano\gof\cc.py in compile_cmodule(self, location)
1521 lib_dirs=self.lib_dirs(),
1522 libs=libs,
-> 1523 preargs=preargs)
1524 except Exception as e:
1525 e.args += (str(self.fgraph),)

G:\anaconda\lib\site-packages\theano\gof\cmodule.py in compile_str(module_name, src_code, location, include_dirs, lib_dirs, libs, preargs, py_module, hide_symbols)
2386 # difficult to read.
2387 raise Exception(‘Compilation failed (return status=%s): %s’ %
-> 2388 (status, compile_stderr.replace(‘\n’, ‘. ‘)))
2389 elif config.cmodule.compilation_warning and compile_stderr:
2390 # Print errors just below the command line.

Reply
- Jason Brownlee March 26, 2019 at 7:58 am #
  
  Sorry, I have not seen this error before, I have some suggestions here:
  https://machinelearningmastery.com/faq/single-faq/why-does-the-code-in-the-tutorial-not-work-for-me
  
  Reply
sun March 25, 2019 at 3:20 pm #

i am getting above error i copy pasted your code

Reply
Drew April 25, 2019 at 3:56 pm #

Hello Jason, great article.

I am getting the below type error when I copy and paste the code for creating an array for defining the class labels. Any way you can help:) I’m using Python 3.

from array import array
from __future__ import unicode_literals

# define training labels
ytrain = array([0 for _ in range(900)] + [1 for _ in range(900)])

—————————————————————————
TypeError Traceback (most recent call last)
in ()
3
4 # define training labels
—-> 5 ytrain = array([0 for _ in range(900)] + [1 for _ in range(900)])

TypeError: must be a unicode character, not list

Reply
- Jason Brownlee April 26, 2019 at 8:25 am #
  
  I’m sorry to hear that, I have some suggestions here:
  https://machinelearningmastery.com/faq/single-faq/why-does-the-code-in-the-tutorial-not-work-for-me
  
  Reply
Avinash June 21, 2019 at 12:06 am #

# define training labels
ytrain = array([0 for _ in range(900)] + [1 for _ in range(900)])
what is this line doing??

Reply
- Jason Brownlee June 21, 2019 at 6:39 am #
  
  Creating class labels for the samples.
  
  Reply
Jeffrey June 25, 2019 at 12:36 am #

Hello,

I am trying to use this to classify product reviews stored in multiple individual text files as having positive or negative sentiment. How do I use the pre-trained model to implement this?

Reply
- Jason Brownlee June 25, 2019 at 6:26 am #
  
  This post shows how to use a pre-trained word embedding:
  https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/
  
  Reply
Sixto Robayo July 24, 2019 at 3:37 pm #

Jason, congratulations good tutorial. Help me Please, I must do a sentiment analysis on global warming, taking the data from Twitter, say 10,000 tweets. How this approach might apply, that is, pre-trained word embedding to determine the polarity of sentiment, positive, neutral and negative. Thanks in advance.

Best regards.

Reply
- Jason Brownlee July 25, 2019 at 7:39 am #
  
  The model described in this post would be a good approach.
  
  Start by collecting and cleaning the dataset:
  https://machinelearningmastery.com/clean-text-machine-learning-python/
  
  Reply
  - Sixto Robayo July 26, 2019 at 2:52 pm #
    
    Hi Jason, Thanks a lot for the link.
    
    Reply
    - Jason Brownlee July 27, 2019 at 6:06 am #
      
      You’re welcome.
      
      Reply
Sixto Robayo July 29, 2019 at 6:48 am #

Hi Sr. Jason,

What tool do you suggest me or the procedure to follow to obtain the polarity or sentiment score of the tweets, to continue with the Sentiment Analysis. I have collected 10,000 tweets and cleaned them following the good advice of: https://machinelearningmastery.com/clean-text-machine-learning-python/.

Thanks in advance.

Reply
- Jason Brownlee July 29, 2019 at 2:17 pm #
  
  The above tutorial would be a good start.
  
  Reply
Arjun Bali August 22, 2019 at 5:45 am #

Hey Jason, I wanted to check with you on how to feed the Embeddings developed from word2vec into a classifier other than Neural nets. I gave it a thought maybe we need to represent every document as a single vector so for that we would either need to avrage out the word embedding from the ones we have generated or maybe do a weighted average using tf idf weights.

Reply
- Jason Brownlee August 22, 2019 at 6:33 am #
  
  The embedding vectors for your inputs are concatenated with other inputs and fed to the model.
  
  Reply
Rango September 12, 2019 at 2:24 am #

First of all, let me thank you for your amazing article.Let me ask some question.If we train word2vec model on larger dataset than the label training set, can we get the generalization power for new words these are not in the label training set.
For Example…
In training set the words “My Puppies are small” are present.
The test sample “I have two little puppies”.
The word “little” is not in the training set but is in word2vec vocabulary.As the word “little” has similar meaning with “small” can we get similar result for two sentences.If we can get similar result can you explain why??
Thank in advance!!

Reply
- Jason Brownlee September 12, 2019 at 5:21 am #
  
  No. The model must see the words during training. They must be part of the known vocab.
  
  Reply
HSA November 16, 2019 at 7:36 am #

I was keep reading this atricle multiple times, still I did not understand the different between the section of train embedding layer and the section of word2vec. I mean is not word2vec is ac training an embedding layer or not?
I hope my question is clear
thanks prof.Jason

Reply
- Jason Brownlee November 17, 2019 at 7:10 am #
  
  word2vec is the general method and the name of a specific algorithm.
  
  We can use the word2vec algorithm or a neural net to learn the vectors.
  
  Does that help?
  
  Reply
HSA November 17, 2019 at 7:59 pm #

Do you mean that section 3-train embedding layer has the same purpose of both sections 4- train word2vec embedding and 5-use pre-trained embedding?
when you said (or a neural net) is it the third section ( 3-train embedding layer)?
thanks

Reply
- Jason Brownlee November 18, 2019 at 6:45 am #
  
  There are many ways to develop a word embedding, such as standalone algorithm or as part of a neural net.
  
  Perhaps I don’t understand your question?
  
  Reply
HSA November 18, 2019 at 7:28 am #

which section of the above article represents a standalone algorithm and which one represents l network?as apart of neural?
thanks in advance

Reply
- Jason Brownlee November 18, 2019 at 1:46 pm #
  
  The section titled “Train Embedding Layer” shows how to train an embedding layer as part of the network.
  
  The section titled “Train word2vec Embedding” shows how to fit a word2vec standalone model, then section “Use Pre-trained Embedding” shows how to use it in a network.
  
  Perhaps re-read the tutorial? It’s all spelled out for you.
  
  Reply
HSA November 18, 2019 at 6:26 pm #

I already read it many times but I am confused in which sections I can use them independently (because they have the same purpose) and compare between the results, Now I confirmed my understanding by your last answer..which show that section 3 is independent from both section 4 and 5 whcih must combined to compare the result with section 3.
thank you so much Prof. Json

Reply
- Jason Brownlee November 19, 2019 at 7:40 am #
  
  You’re welcome.
  
  Reply
HSA November 19, 2019 at 1:36 am #

why in this section of code positive_lines and positive_docs differs?
# load training data
positive_lines = process_docs(‘txt_sentoken/pos’, vocab, True)
negative_lines = process_docs(‘txt_sentoken/neg’, vocab, True)
sentences = negative_docs + positive_docs

the same thing for negative_lines and negative_docs, it seems that it should have the same name… am I right?

I mean like this
# load training data
positive_docs= process_docs(‘txt_sentoken/pos’, vocab, True)
negative_docs = process_docs(‘txt_sentoken/neg’, vocab, True)
sentences = negative_docs + positive_docs

Reply
- Jason Brownlee November 19, 2019 at 7:46 am #
  
  We are calling the same code, but loading from different source directories.
  
  Reply
HSA November 19, 2019 at 10:04 pm #

maybe you misunderstood the question
# load training data
positive_lines = process_docs(‘txt_sentoken/pos’, vocab, True)
negative_lines = process_docs(‘txt_sentoken/neg’, vocab, True)
sentences = negative_docs + positive_docs

an error appears for negative_docs and positive_docs because they are undefined, given that I copied the entire code without any changes

Reply
- Jason Brownlee November 20, 2019 at 6:17 am #
  
  Thanks, fixed!
  
  Reply
Fabien November 21, 2019 at 8:43 am #

Hi Jason,

I’m using sequence of numeric values coming from product description that I preprocessed via tokenization.
I’m trying to set up a RNN model based on regression to predict the price of the product
For now, I only added a LSTM layer and a Dense(1) layer for the output.
How would you set up this kind of model regarding the different layers ?
Moreover, which metric would you use in the configururation of the model for training?

Sorry if my questions are a bit confuse
Thank you

PS : I have other features than product description but I don’t know how I can manage them (concatenate them together to have a complete feature to use as input for the model ? or only some of them ?)

Reply
- Jason Brownlee November 21, 2019 at 1:27 pm #
  
  I would recommend testing many different models and many different configurations and use results to guide the exploration.
  
  For regression, MSE or RMSE are a good start. Also scaling data prior to fitting may help.
  
  If you have descriptions, e.g. text, a bag of words model can encode the text and be fed to the model by a separate input, e.g. see this:
  https://machinelearningmastery.com/keras-functional-api-deep-learning/
  
  Reply
martin November 22, 2019 at 6:12 pm #

In this example, tokenizer.fit_on_texts does the ‘one hot encoding’ internally? Users don’t need to worry about the encoding.

Reply
- Jason Brownlee November 23, 2019 at 6:48 am #
  
  Integer encoding. Yes.
  
  Reply
Nikolay Oskolkov November 27, 2019 at 8:42 am #

Hi Jason, why is the Dense network here https://machinelearningmastery.com/deep-learning-bag-of-words-model-sentiment-analysis/ outperforms the CNN in this tutorial for the movie review sentiment analysis data set?

Reply
- Jason Brownlee November 27, 2019 at 1:46 pm #
  
  Great question!
  
  The models are for demonstration (e.g. here’s how you could implement the approach), neither are optimized to best solve the problem.
  
  In fact, this applies to all code on the blog.
  
  Reply
HSA November 28, 2019 at 10:09 pm #

1-Google word2vec used CBOW and skip-gram to build their model, what did we use here to build our pre-trained model?

2- I wonder why Glove has its model as .txt format file while Google word2vec is .bin format? and if I used one of these formats does it affect on the code I mean if I used google word2vec instead of Golve should I make variant changes on the code?

3- Does the .bin and .txt model formats have the same purpose? if they have the same purpose but only different formats which one is better?

Thanks, prof.Jason

Reply
- Jason Brownlee November 29, 2019 at 6:49 am #
  
  We learned the embedding as part of the model directly.
  
  Different formats – no idea. Researcher preference. Format does not matter.
  
  Yes, both contain word vectors.
  
  Reply
HSA November 29, 2019 at 7:50 pm #

I understand from my reading on the Gensim website that .txt format model is called KeyedVectors, is my understanding right?
https://radimrehurek.com/gensim/models/keyedvectors.html

Reply
- Jason Brownlee November 30, 2019 at 6:29 am #
  
  I guess so.
  
  Reply
Makhloufi lyes January 2, 2020 at 9:52 pm #

Hi jason, thank you for this tutorial. I have a trouble when I pass my vocabulary into Word2Vec model the size of learned words is dramatically decreasing, can you enlighten me on this please.

Reply
- Lyes Makhloufi January 2, 2020 at 9:53 pm #
  
  I 28830 as initial vocabulary size, I get only 119 as learned words after Word2Vec.
  
  Reply
  - Jason Brownlee January 3, 2020 at 7:30 am #
    
    Perhaps your words are not in the pre-trained model. In that case, you may need to change your words or train your own embedding.
    
    Reply
- Jason Brownlee January 3, 2020 at 7:29 am #
  
  I don’t follow. Sorry, what is the problem exactly?
  
  Reply
  - Lyes Makhloufi January 4, 2020 at 3:18 am #
    
    Yes, I’m trying to train my word2vec model as you did.
    
    model = Word2Vec(sentences, size=100, window=5, workers=8, min_count=1)
    
    My sentences variable sentences contain 28830 but my vocab length is 119.
    
    words = list(model.wv.vocab)
    print(‘Vocabulary size: %d’ % len(words)) *** this line print 119 ***
    
    Reply
    - Jason Brownlee January 4, 2020 at 8:37 am #
      
      Perhaps you don’t have many words in your dataset?
      
      Reply
Prashant Gavali January 11, 2020 at 12:44 am #

Thank you for such great post. I have a question. If I am using RNN for sentence sentiment classification, whether embedding layer is again required or we can pass pre-trained word vector for each word at each timestamp. If we have to use embedding layer, what should be the size of input_length? One or any other. Thank you in advanced

Reply
- Jason Brownlee January 11, 2020 at 7:26 am #
  
  Perhaps test different lengths and see what works best for your dataset.
  
  Reply
Eric January 25, 2020 at 5:48 pm #

First of all, I praise your hard word. Your content has been of immense help to me every single day since I started this path.

And I have a question as well. Is there a reason not to specify a batch-size? I’ve seen that’s like the standard practice in all books, however in this case I don’t see it and I am curious about it.

Reply
- Jason Brownlee January 26, 2020 at 5:15 am #
  
  Thanks!
  
  Yes. I often use the default batch size of 32.
  
  Reply
HSA January 30, 2020 at 2:58 am #

Traceback (most recent call last):

raw_embedding = load_embedding(‘glove.6B/glove.6B/glove.6B.100d.txt’)

embedding[parts[0]] = asarray(parts[1:], dtype=’float32′)

return array(a, dtype, copy=False, order=order)
ValueError: could not convert string to float: ‘ng’

could anyone help me why this error appears although I copied and paste the same code?
The code run only if I remove dtype=’float32′

Reply
- Jason Brownlee January 30, 2020 at 6:55 am #
  
  Sorry to hear that, this might help:
  https://machinelearningmastery.com/faq/single-faq/why-does-the-code-in-the-tutorial-not-work-for-me
  
  Reply
HSA January 31, 2020 at 1:51 am #

simply, the solution of previous mentioned issue is to fix this line of code

file = open(filename,’r’)

to make it
file = open(filename,encoding=”utf8″)
this may help anyone who has the same issue of me,
thank you

Reply
- Jason Brownlee January 31, 2020 at 7:58 am #
  
  Nice, thanks for sharing.
  
  Reply
Rony Sharma February 15, 2020 at 1:54 am #

Hello Jason Brownlee,
I am Use in Your Code Its a great for me. But I have many problems, like When I am using 700 training data and test data 300 there is a problem shows that:

ValueError: Input arrays should have the same number of samples as target arrays. Found 2700 input samples and 2100 target samples.

How to Solved this problem? Please help me.

Reply
- Jason Brownlee February 15, 2020 at 6:35 am #
  
  If you split the input, you must also split the output/target array in the same way.
  
  Reply
HSA February 15, 2020 at 11:19 pm #

I wonder why I should use the vocabulary text file in the classification part?
we used it in the feature extraction part (when we built our pre-trained model) to save most frequent words in the corpus, but why we used it in the model construction part for classification when we read any dataset to classify it according to the pre-trained model features?
given that when I removed the vocab file from the classification part, I got better f1-score, then I start wondering why we should use it in the classification part??

thanks

Reply
- Jason Brownlee February 16, 2020 at 6:07 am #
  
  We use it to trim the data down to the pre-defined vocab – I believe.
  
  Reply
Rony Sharma February 21, 2020 at 12:31 am #

Dear Jason Brownlee,

How are You? Hopefully, You are well. How to implement the RNN model in this embedding layer code. Give me some Suggestions. Also, any link on how to implement the embedding layer with RNN. It will be grateful for me

Thanks in advance

Reply
- Jason Brownlee February 21, 2020 at 8:24 am #
  
  I have many examples, perhaps start here:
  https://machinelearningmastery.com/start-here/#nlp
  
  Reply
HSA February 23, 2020 at 11:04 pm #

How I can upload image to clarify my question?

Reply
- Jason Brownlee February 24, 2020 at 7:42 am #
  
  There are many places on the web where you can upload images. Blog, social, github, imgur, etc.
  
  Reply
  - HSA February 24, 2020 at 9:46 am #
    
    mmm I do not know if you understand my question, I want to upload image in the comments section of your article to ask questions depend on the image?
    
    Reply
    - Jason Brownlee February 24, 2020 at 1:25 pm #
      
      Yes. The answer is to upload it elsewhere and link to it.
      
      Reply
Md. Asraful Islam February 26, 2020 at 5:47 pm #

Dear Jason Brownlee,

How are you? hope you are fine.

Sir, I need your help. I need to create a glove file for my sentiment analysis but I don’t know to create a glove file. I saw your website where you use the pre-trained glove code that is ” glove.6B.100d.txt ” file. But I need to create glove file for my dataset. Could you give me a suggestion or code how I can create my own glove.txt file.

Reply
- Jason Brownlee February 27, 2020 at 5:40 am #
  
  Sorry, I don’t have an exampel of creating a glove embedding.
  
  Reply
Saifur Rahman March 23, 2020 at 12:15 am #

Dear Jason Brownlee,
Which type of plot used in this CNN model? Give me some suggestions?

Reply
- Jason Brownlee March 23, 2020 at 6:13 am #
  
  Sorry, I don’t understand. What would you plot exactly?
  
  Reply
  - Saifur Rahman March 24, 2020 at 12:48 am #
    
    Scatter plot .
    
    Reply
    - Jason Brownlee March 24, 2020 at 6:04 am #
      
      I don’t know how you might plot a model using a scatter plot.
      
      Reply
Rony Sharma March 27, 2020 at 12:24 am #

Hello,

How to get f1-score in these models or code?

Reply
- Jason Brownlee March 27, 2020 at 6:14 am #
  
  See this tutorial for using f1 with keras:
  https://machinelearningmastery.com/how-to-calculate-precision-recall-f1-and-more-for-deep-learning-models/
  
  Reply
Saifur Rahman April 14, 2020 at 1:05 am #

hello sir,
in this code for how to get confusion matrix value? Please help me.

Reply
- Jason Brownlee April 14, 2020 at 6:20 am #
  
  This tutorial will show you how:
  https://machinelearningmastery.com/how-to-calculate-precision-recall-f1-and-more-for-deep-learning-models/
  
  Reply
Michael Szczepaniak April 22, 2020 at 3:27 am #

Assuming each review is 0 if negative and 1 if positive, is there a straightforward way to see the probabilities the model assigns to each review?

I’ll be using pre-trained embeddings and after the model is trained on a set of documents, I need to be able to get send the model a document and have it give me a probability if it’s relevant wrt my training data (close to 1 if it is, close to 0 if it isn’t).

Reply
- Michael Szczepaniak April 22, 2020 at 3:34 am #
  
  I think your 2018-11-29 reply to chkib arslan is what I was looking for.
  
  Reply
  - Jason Brownlee April 22, 2020 at 6:05 am #
    
    Great!
    
    Also see this:
    https://machinelearningmastery.com/how-to-make-classification-and-regression-predictions-for-deep-learning-models-in-keras/
    
    Reply
- Jason Brownlee April 22, 2020 at 6:05 am #
  
  Yes, calling model.predict() gives the probability.
  
  Reply
Michael Szczepaniak April 22, 2020 at 3:47 am #

Is there a way to save the CNN model itself after it’s been trained? I need to create lots of these and eventually version control them, but now, I just need to be able to save the weights and the configuration that I can reconstitute later.

Reply
- Jason Brownlee April 22, 2020 at 6:05 am #
  
  Yes, see this:
  https://machinelearningmastery.com/save-load-keras-deep-learning-models/
  
  Reply
Ibrahim May 5, 2020 at 3:38 am #

In your model, How you handle OOV(out of Vocab) words?
I am looking for implementing Character level word embedding to handle OOV word. And I find 1D CNN in kim,s paper. But I don’t understand how i train them. And how i find embedding matrics. Could you help me?

Reply
- Jason Brownlee May 5, 2020 at 6:35 am #
  
  We remove them during pre-processing or map them to index 0 for “unknown”.
  
  Reply
Mohammad May 16, 2020 at 4:29 am #

Hi Jason,

First, thanks for your great post!

I have a question. Imagine that I have a dataset, including about 2000 reviews, and only 200 of them are regarded as negative, which means the rest of the reviews demonstrate positive sentiment.

If I train the model on all data, the accuracy is about 97 percent, which I think is a little imaginary! However, if I select 200 negatives and 200 random positive reviews and fit the model on 400 data, the accuracy decreases to 85 percent. Which approach is more academic, scientific, and accurate?

Reply
- Jason Brownlee May 16, 2020 at 6:21 am #
  
  You’re welcome.
  
  Don’t use accuracy:
  https://machinelearningmastery.com/failure-of-accuracy-for-imbalanced-class-distributions/
  
  Select a better metric:
  https://machinelearningmastery.com/tour-of-evaluation-metrics-for-imbalanced-classification/
  
  Perhaps explore class weighting to lift model skill:
  https://machinelearningmastery.com/cost-sensitive-neural-network-for-imbalanced-classification/
  
  Reply
Mohammad May 16, 2020 at 8:12 am #

Thanks, Jason! Those were really helpful.

I got another question. I’ve found this post for text encoding needed for the embedding layer of a CNN.

https://machinelearningmastery.com/how-to-prepare-categorical-data-for-deep-learning-in-python/

What’s the difference between one-hot encoding and Word2Vec function? And which one is more powerful?

Thanks a lot!

Reply
- Jason Brownlee May 16, 2020 at 10:12 am #
  
  They are both a distributed representation.
  
  The one-hot is sparse and specific. The embedding is learned and more flexible – capable of adapting to the specifics of your data and model. Yes, it can be more useful/powerful.
  
  Try each method and use whatever gives the best performance.
  
  Reply
  - Mohammad May 18, 2020 at 6:43 pm #
    
    Thanks,
    
    Can I use Word2Vec for non_English texts or captions?
    
    Reply
    - Jason Brownlee May 19, 2020 at 5:58 am #
      
      Yes!
      
      Reply
Dilani June 5, 2020 at 7:18 pm #

Hi,

I’m using LSTM and glove pre trained word embeddings to build a sentiment analysis model and the code is almost similar to the one you have written here. Currently I’m struggling to find a way to use this model to predict the sentiment when we give a row text as the input.I’m loading the model which was saved as .h5 file and I’m doing the same pre-processing process to the text which i did while training the model. Apart from that what else I should do before calling model.predict method.

I’m new to machine learning so desperately looking for an answer to this. Couldn’t find any proper solution even though i googled. I would really appreciate if you can guide me on this.

Reply
- Jason Brownlee June 6, 2020 at 7:48 am #
  
  The data prep objects should be saved with the model, e.g. identical to what was used to prepare the training set.
  
  Beyond that, it sounds like you’re on the right track. Perhaps you need to step through your code/debug.
  
  Reply
Jay June 20, 2020 at 3:29 pm #

Hi, Suppose that I train a model using the pretrained GLove or Word2Vec. But when testing on a completely new value shouldn’t i need to get the word vectors of the given input ?? So i will have to update the embedding ?

Reply
- Jason Brownlee June 21, 2020 at 6:18 am #
  
  If text contains a word not known to the embedding, it is mapped to the value 0 which means “unknown”.
  
  Reply
Ahmad January 10, 2021 at 11:50 am #

Thanks for your great post!

how can I detect emotion recognition in the text using ML algorithms in Python?

Reply
- Jason Brownlee January 10, 2021 at 1:09 pm #
  
  Perhaps start with a dataset that you can use to train a model on the emotions you’re interested in identifying.
  
  Reply
SK March 26, 2021 at 9:08 am #

Hi Jason,

Thanks for your great post!

on your post, you utilized supervised neural network model to automatically classify movie reviews as Positive or Negative. I was wondering how to we classify or cluster unlabled movie reviews as Positive or Negative, this is unsupervised method.

Also, do you have the information about “Positive and Negative Words Databases” which contain English words associated with a sentiment score included between -1 (most negative) and 1 (most positive). If so, which one do you recommend mostly. Is it free or have to buy?

Thank you!!

Reply
- Jason Brownlee March 29, 2021 at 5:48 am #
  
  If you have historical examples with class labels, you can train a model on available data, then use that model to classify new examples as one of the two labels.
  
  If you do not have labels, perhaps you can label them manually?
  
  If you don’t need/want labels then it will be an unsupervised learning problem, I guess.
  
  Reply
Sudhakar Vankamamidi March 29, 2021 at 2:29 am #

Code samples depicting sentiment analysis would have been more helpful

Reply
- Jason Brownlee March 29, 2021 at 6:19 am #
  
  The above examples is exactly this.
  
  Reply
SK March 29, 2021 at 7:22 am #

Hi Jason,

Thanks for your reply. No, you have not historical examples with class labels and I prefer not to label them in considering of subjective bias.

Yes, I recon it will be an unsupervised learning problem and the method of clustering and that is why I was wondering if have the information about “Positive and Negative Words Databases” which contain English words associated with a sentiment score included between -1 (most negative) and 1 (most positive). If so, which one do you recommend mostly. Is it free or have to buy? Then I can cluster these documents according to their distribution of sentiment scores of words from document. Thanks!

Reply
- Jason Brownlee March 30, 2021 at 5:52 am #
  
  If you do not have labelled data and do not want to label it, then it does sound like a clustering problem.
  
  If you want to use sentiment of already classified data, then this sounds like a straight programming problem – not machine learning.
  
  Reply
Ali Al Bataineh March 29, 2021 at 11:35 am #

Hey Jason,
Nice article like always.
You have use tokenizer to mapping of words to integers, which is I clearly understand it because the embedding layer expects the input to be in integer form. In case of tf-idf or one-hot encoding approaches, do they also except the input to be in integer form before using them? Last question is that the embedding layer also require all inputs to be in the same length; does other approaches like tf-idf also require the same?

Thank you so much! You’re the best in the west!

Reply
- Jason Brownlee March 30, 2021 at 5:54 am #
  
  Typically bag of words and one hot encoding methods allow you to convert text to integer vectors directly.
  
  Embedding layers assume inputs are integers, e.g. ordinal encoded words/categories.
  
  Reply
Emran April 5, 2021 at 2:52 am #

Hi Sir,

Thanks for this precious article.

However, I have a question that if I want to split the dataset into 80% for training and 20% for test. Therefore, how can I adapt the code.

I tried to do the following, but it doesn’t work!

def process_docs(directory, vocab, is_trian):
documents = list()
# walk through all files in the folder
for filename in listdir(directory):
# skip any reviews in the test set
if is_trian and filename.startswith(‘cv8’) and filename.startswith(‘cv9’):
continue
if not is_trian and not filename.startswith(‘cv8’) and not filename.startswith(‘cv9’):
continue
# create the full path of the file to open
path = directory + ‘/’ + filename
# load the doc
doc = load_doc(path)
# clean doc
tokens = clean_doc(doc, vocab)
# add to list
documents.append(tokens)
return documents

# define training labels
ytrain = array([0 for _ in range(800)] + [1 for _ in range(800)])
# define test labels
ytest = array([0 for _ in range(200)] + [1 for _ in range(200)])

I shall be thankful to you for your assistance.

Reply
- Jason Brownlee April 5, 2021 at 6:14 am #
  
  You can change the code so 000 to 799 are used for train and 800 onwards are used for test.
  
  Sorry, I don’t have the capacity to review your code.
  
  Reply
Eslam Khodair April 7, 2021 at 12:38 am #

Hello Jason,

Thanks for your clear explanation,
can we use a non-sqaure kernel size for this task?

Reply
- Jason Brownlee April 7, 2021 at 5:10 am #
  
  Perhaps try it and see.
  
  Reply
Sona Joseph April 25, 2021 at 8:29 pm #

Sir,
Thanks for your great article! We were used this code for classify the text of different authors in our college project. It shows the accuracy for embedding layer. But in case of word2vec, it does not print any accuracy. It shows nan value for loss function. Can you please tell the reason for that. It would be a great help for us.

Reply
- Jason Brownlee April 26, 2021 at 5:36 am #
  
  You’re welcome.
  
  That sounds like a great project!
  
  Sorry to hear that you are having an error perhaps the tips here will help you debug your code:
  https://machinelearningmastery.com/faq/single-faq/can-you-read-review-or-debug-my-code
  
  Reply
  - Sona Joseph May 5, 2021 at 11:27 pm #
    
    Thank you so much sir for your great help. I will check and reply after that.
    
    Reply
    - Jason Brownlee May 6, 2021 at 5:46 am #
      
      You’re welcome.
      
      Reply
      - Sona Joseph July 13, 2021 at 7:19 pm #
        
        Sir,
        By using the tips that you had given we get 66 accuracy. We had classified the text of nine authors. By increasing the number of documents, and by some change in hidden layers now it shows 100 for test accuracy. Sir my question is, Is that accuracy is an error or is 100 accuracy is possible for a model? For getting accuracy of train and test we had used the same code that you had given above.
      - Jason Brownlee July 14, 2021 at 5:29 am #
        
        Good question, perhaps this will help:
        https://machinelearningmastery.com/faq/single-faq/what-does-it-mean-if-i-have-0-error-or-100-accuracy
Ali May 8, 2021 at 10:35 am #

Thank you for this post!

I want to ask you about how we can calculate vocabulary coverage for pre-trained transfer learning models.

Reply
- Jason Brownlee May 9, 2021 at 5:52 am #
  
  Not sure what you mean exactly. Perhaps a count of the words in your vocab vs the words in the existing models vocab?
  
  Reply
SURAJ TR May 23, 2021 at 7:08 pm #

I am unable to open the dataset link. Please help.

Reply
- Jason Brownlee May 24, 2021 at 5:43 am #
  
  You can download all of the datasets from here:
  https://github.com/jbrownlee/Datasets
  
  Reply
Okkes June 13, 2021 at 8:02 pm #

Hi Jason,

I get the following error:

ValueError: Input 0 of layer dense is incompatible with the layer: expected axis -1 of input shape to have value 342272 but received input with shape (None, 49888)

Here is the model summary:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, 21400, 32) 1912864
_________________________________________________________________
conv1d (Conv1D) (None, 21393, 32) 8224
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 10696, 32) 0
_________________________________________________________________
flatten (Flatten) (None, 342272) 0
_________________________________________________________________
dense (Dense) (None, 10) 3422730
_________________________________________________________________
dense_1 (Dense) (None, 1) 11
=================================================================
Total params: 5,343,829
Trainable params: 5,343,829
Non-trainable params: 0

What could cause this error? Can you please help me?

Thanks in advance,
OKkes

Reply
- Jason Brownlee June 14, 2021 at 5:42 am #
  
  Sorry to hear that, perhaps some of these tips will help:
  https://machinelearningmastery.com/faq/single-faq/why-does-the-code-in-the-tutorial-not-work-for-me
  
  Reply
Aklilu June 15, 2021 at 7:28 pm #

please, I have a project, I need help to get a sample code with the title called “text clustering using Deep Learning”.
I have been reading and searching your books and different sites, that are related to my topic, but unfortunately, I got mostly what is related to text classification only. I couldn’t get samples that may help me with my project,

additional question: How develop Word Embedding with CNN to clustering sample datasets?

Reply
- Jason Brownlee June 16, 2021 at 6:19 am #
  
  Sorry, I don’t have any examples of text clustering.
  
  Reply
Aryan Chauhan October 13, 2021 at 5:13 pm #

Hi Jason , that’s a great article . But after initializing embedding_weights using word2vec and glove I am getting loss “nan” every time while training. Can you help me with that why I am geeting loss = nan I can’t figure it out after why this is happening?

Thanks

Reply
- Adrian Tam October 14, 2021 at 3:16 am #
  
  If your loss is nan, check your data. Likely there is some missing data that caused the loss metric cannot be computed.
  
  Reply
Dan April 5, 2022 at 4:35 am #

HI jason, thanks for this incredible article, how would you go about doing the same but for larger dataset? would there be much to change? like your preprocessing methods and the dimensions of the model? I would like to run the models again with larger dataset this time.

Thanks again,
Dan.

Reply
- James Carmichael April 5, 2022 at 7:01 am #
  
  Hi Dan…I generally recommend working with a small sample of your dataset that will fit in memory.
  
  I recommend this because it will accelerate the pace at which you learn about your problem:
  
  It is fast to test different framings of the problem.
  It is fast to summarize and plot the data.
  It is fast to test different data preparation methods.
  It is fast to test different types of models.
  It is fast to test different model configurations.
  The lessons that you learn with a smaller sample often (but not always) translate to modeling with the larger dataset.
  
  You can then scale up your model later to use the entire dataset, perhaps trained on cloud infrastructure such as Amazon EC2.
  
  There are also many other options if you are interested in training with a large dataset. I list 7 ideas in this post:
  
  7 Ways to Handle Large Data Files for Machine Learning
  
  Reply
Dan April 5, 2022 at 11:49 pm #

Hi James thanks for the response, I do agree, however I was able to run other CNN models with the imdb review set(25,000 reviews) from keras that were not time and computationally expensive. I would like to see how that data set preforms under Jasons model and what i would have to change, ive heard its good practice to have a larger dataset.
thanks

Reply
Jacob June 16, 2022 at 2:00 am #

Hey, thanks for the instructive article.

Am I correct in assuming that if I want to train a model that can evaluate unknown texts (e.g. texts that users choose to submit for evaluation through a web app), then I shouldn’t try to have the model learn its own embedding, even though that’s the most accurate option for evaluating texts that you have access to during setup?

Reply
- James Carmichael June 16, 2022 at 10:54 am #
  
  Hi Jacob…Your understanding is correct.
  
  Reply
Luis Mazabuel December 13, 2023 at 5:11 pm #

Hi James, thank you for the content.

I wonder how could I adapt the section 5 of your blog (5. Use Pre-trained Embedding) to any NN focused on classify, in a multiclass task, some spanish texts with any other pre-trained embedding model.

For example if I want to use the Universal Sentence Encoder – Multilingual from TensorFlow or the BERT-base-multilingual from Google, some pre-trained model by OpenAI (like GPTs), or any other pre-trained LLM model that satisfy the context of my task (spanish, multiclass classification, legal documents context, etc.).

Reply
- James Carmichael December 14, 2023 at 11:37 am #
  
  Hi Luis…You are very welcome! The following resource may be of interest to you:
  
  https://pypi.org/project/sentiment-analysis-spanish/
  
  Reply
Luis Mazabuel December 28, 2023 at 1:21 am #

But, in my case I’am not interested in sentiment analysis, but multiclass classification task. I want to train a NN with a pre-trained embedding for spanish or multilingual documents.

Reply

Navigation

Deep Convolutional Neural Network for Sentiment Analysis (Text Classification)

Develop a Deep Learning Model to Automatically Classify Movie Reviews
as Positive or Negative in Python with Keras, Step-by-Step.

Tutorial Overview

Python Environment

Need help with Deep Learning for Text Data?

1. Movie Review Dataset

2. Data Preparation

Split into Train and Test Sets

Loading and Cleaning Reviews

Define a Vocabulary

3. Train Embedding Layer

4. Train word2vec Embedding

5. Use Pre-trained Embedding

Further Reading

Papers

APIs

Embedding Methods

Related Posts

Summary

Develop Deep Learning models for Text Data Today!

Develop Your Own Text models in Minutes

Finally Bring Deep Learning to your Natural Language Processing Projects

More On This Topic

258 Responses to Deep Convolutional Neural Network for Sentiment Analysis (Text Classification)

Leave a Reply Click here to cancel reply.

Navigation

Develop a Deep Learning Model to Automatically Classify Movie Reviews as Positive or Negative in Python with Keras, Step-by-Step.

Tutorial Overview

Python Environment

Need help with Deep Learning for Text Data?

1. Movie Review Dataset

2. Data Preparation

Split into Train and Test Sets

Loading and Cleaning Reviews

Define a Vocabulary

3. Train Embedding Layer

4. Train word2vec Embedding

5. Use Pre-trained Embedding

Further Reading

Papers

APIs

Embedding Methods

Related Posts

Summary

Develop Deep Learning models for Text Data Today!

Develop Your Own Text models in Minutes

Finally Bring Deep Learning to your Natural Language Processing Projects

More On This Topic

258 Responses to Deep Convolutional Neural Network for Sentiment Analysis (Text Classification)

Leave a Reply Click here to cancel reply.

Develop a Deep Learning Model to Automatically Classify Movie Reviews
as Positive or Negative in Python with Keras, Step-by-Step.