Predict Sentiment From Movie Reviews Using Deep Learning

Sentiment analysis is a natural language processing problem where text is understood and the underlying intent is predicted.

In this post, you will discover how you can predict the sentiment of movie reviews as either positive or negative in Python using the Keras deep learning library.

After reading this post you will know:

  • About the IMDB sentiment analysis problem for natural language processing and how to load it in Keras.
  • How to use word embedding in Keras for natural language problems.
  • How to develop and evaluate a multi-layer perception model for the IMDB problem.
  • How to develop a one-dimensional convolutional neural network model for the IMDB problem.

Let’s get started.

  • Update Oct/2016: Updated examples for Keras 1.1.0 and TensorFlow 0.10.0.
  • Update Mar/2017: Updated example for Keras 2.0.2, TensorFlow 1.0.1 and Theano 0.9.0.
Predict Sentiment From Movie Reviews Using Deep Learning

Predict Sentiment From Movie Reviews Using Deep Learning
Photo by SparkCBC, some rights reserved.

IMDB Movie Review Sentiment Problem Description

The dataset is the Large Movie Review Dataset often referred to as the IMDB dataset.

The Large Movie Review Dataset (often referred to as the IMDB dataset) contains 25,000 highly polar moving reviews (good or bad) for training and the same amount again for testing. The problem is to determine whether a given moving review has a positive or negative sentiment.

The data was collected by Stanford researchers and was used in a 2011 paper [PDF] where a split of 50/50 of the data was used for training and test. An accuracy of 88.89% was achieved.

The data was also used as the basis for a Kaggle competition titled “Bag of Words Meets Bags of Popcorn” in late 2014 to early 2015. Accuracy was achieved above 97% with winners achieving 99%.

Beat the Math/Theory Doldrums and Start using Deep Learning in your own projects Today, without getting lost in “documentation hell”

Deep Learning With Python Mini-CourseGet my free Deep Learning With Python mini course and develop your own deep nets by the time you’ve finished the first PDF with just a few lines of Python.

Daily lessons in your inbox for 14 days, and a DL-With-Python “Cheat Sheet” you can download right now.   

Download Your FREE Mini-Course  

 

Load the IMDB Dataset With Keras

Keras provides access to the IMDB dataset built-in.

The keras.datasets.imdb.load_data() allows you to load the dataset in a format that is ready for use in neural network and deep learning models.

The words have been replaced by integers that indicate the absolute popularity of the word in the dataset. The sentences in each review are therefore comprised of a sequence of integers.

Calling imdb.load_data() the first time will download the IMDB dataset to your computer and store it in your home directory under ~/.keras/datasets/imdb.pkl as a 32 megabyte file.

Usefully, the imdb.load_data() provides additional arguments including the number of top words to load (where words with a lower integer are marked as zero in the returned data), the number of top words to skip (to avoid the “the”‘s) and the maximum length of reviews to support.

Let’s load the dataset and calculate some properties of it. We will start off by loading some libraries and loading the entire IMDB dataset as a training dataset.

Next we can display the shape of the training dataset.

Running this snippet, we can see that there are 50,000 records.

We can also print the unique class values.

We can see that it is a binary classification problem for good and bad sentiment in the review.

Next we can get an idea of the total number of unique words in the dataset.

Interestingly, we can see that there are just under 100,000 words across the entire dataset.

Finally, we can get an idea of the average review length.

We can see that the average review has just under 300 words with a standard deviation of just over 200 words.

Looking a box and whisker plot for the review lengths in words, we can probably see an exponential distribution that we can probably cover the mass of the distribution with a clipped length of 400 to 500 words.

Review Length in Words for IMDB Dataset

Review Length in Words for IMDB Dataset

Word Embeddings

A recent breakthrough in the field of natural language processing is called word embedding.

This is a technique where words are encoded as real-valued vectors in a high-dimensional space, where the similarity between words in terms of meaning translates to closeness in the vector space.

Discrete words are mapped to vectors of continuous numbers. This is useful when working with natural language problems with neural networks and deep learning models are we require numbers as input.

Keras provides a convenient way to convert positive integer representations of words into a word embedding by an Embedding layer.

The layer takes arguments that define the mapping including the maximum number of expected words also called the vocabulary size (e.g. the largest integer value that will be seen as an integer). The layer also allows you to specify the dimensionality for each word vector, called the output dimension.

We would like to use a word embedding representation for the IMDB dataset.

Let’s say that we are only interested in the first 5,000 most used words in the dataset. Therefore our vocabulary size will be 5,000. We can choose to use a 32-dimension vector to represent each word. Finally, we may choose to cap the maximum review length at 500 words, truncating reviews longer than that and padding reviews shorter than that with 0 values.

We would load the IMDB dataset as follows:

We would then use the Keras utility to truncate or pad the dataset to a length of 500 for each observation using the sequence.pad_sequences() function.

Finally, later on, the first layer of our model would be an word embedding layer created using the Embedding class as follows:

The output of this first layer would be a matrix with the size 32×500 for a given review training or test pattern in integer format.

Now that we know how to load the IMDB dataset in Keras and how to use a word embedding representation for it, let’s develop and evaluate some models.

Simple Multi-Layer Perceptron Model for the IMDB Dataset

We can start off by developing a simple multi-layer perceptron model with a single hidden layer.

The word embedding representation is a true innovation and we will demonstrate what would have been considered world class results in 2011 with a relatively simple neural network.

Let’s start off by importing the classes and functions required for this model and initializing the random number generator to a constant value to ensure we can easily reproduce the results.

Next we will load the IMDB dataset. We will simplify the dataset as discussed during the section on word embeddings. Only the top 5,000 words will be loaded.

We will also use a 50%/50% split of the dataset into training and test. This is a good standard split methodology.

We will bound reviews at 500 words, truncating longer reviews and zero-padding shorter reviews.

Now we can create our model. We will use an Embedding layer as the input layer, setting the vocabulary to 5,000, the word vector size to 32 dimensions and the input_length to 500. The output of this first layer will be a 32×500 sized matrix as discussed in the previous section.

We will flatten the Embedded layers output to one dimension, then use one dense hidden layer of 250 units with a rectifier activation function. The output layer has one neuron and will use a sigmoid activation to output values of 0 and 1 as predictions.

The model uses logarithmic loss and is optimized using the efficient ADAM optimization procedure.

We can fit the model and use the test set as validation while training. This model overfits very quickly so we will use very few training epochs, in this case just 2.

There is a lot of data so we will use a batch size of 128. After the model is trained, we evaluate its accuracy on the test dataset.

Running this example fits the model and summarizes the estimated performance. We can see that this very simple model achieves a score of nearly 86.94% which is in the neighborhood of the original paper, with very little effort.

I’m sure we can do better if we trained this network, perhaps using a larger embedding and adding more hidden layers. Let’s try a different network type.

One-Dimensional Convolutional Neural Network Model for the IMDB Dataset

Convolutional neural networks were designed to honor the spatial structure in image data whilst being robust to the position and orientation of learned objects in the scene.

This same principle can be used on sequences, such as the one-dimensional sequence of words in a movie review. The same properties that make the CNN model attractive for learning to recognize objects in images can help to learn structure in paragraphs of words, namely the techniques invariance to the specific position of features.

Keras supports one-dimensional convolutions and pooling by the Conv1D and MaxPooling1D classes respectively.

Again, let’s import the classes and functions needed for this example and initialize our random number generator to a constant value so that we can easily reproduce results.

We can also load and prepare our IMDB dataset as we did before.

We can now define our convolutional neural network model. This time, after the Embedding input layer, we insert a Conv1D layer. This convolutional layer has 32 feature maps and reads embedded word representations 3 vector elements of the word embedding at a time.

The convolutional layer is followed by a 1D max pooling layer with a length and stride of 2 that halves the size of the feature maps from the convolutional layer. The rest of the network is the same as the neural network above.

We also fit the network the same as before.

Running the example, we are first presented with a summary of the network structure. We can see our convolutional layer preserves the dimensionality of our Embedding input layer of 32-dimensional input with a maximum of 500 words. The pooling layer compresses this representation by halving it.

Running the example offers a small but welcome improvement over the neural network model above with an accuracy of nearly 87.79%.

Again, there is a lot of opportunity for further optimization, such as the use of deeper and/or larger convolutional layers. One interesting idea is to set the max pooling layer to use an input length of 500. This would compress each feature map to a single 32 length vector and may boost performance.

Summary

In this post, you discovered the IMDB sentiment analysis dataset for natural language processing.

You learned how to develop deep learning models for sentiment analysis including:

  • How to load and review the IMDB dataset within Keras.
  • How to develop a large neural network model for sentiment analysis.
  • How to develop a one-dimensional convolutional neural network model for sentiment analysis.

Do you have any questions about sentiment analysis or this post? Ask your questions in the comments and I will do my best to answer.

Frustrated With Your Progress In Deep Learning?

 What If You Could Develop Your Own Deep Nets in Minutes

...with just a few lines of Python

Discover how in my new Ebook: Deep Learning With Python

It covers self-study tutorials and end-to-end projects on topics like:
Multilayer PerceptronsConvolutional Nets and Recurrent Neural Nets, and more...

Finally Bring Deep Learning To
Your Own Projects

Skip the Academics. Just Results.

Click to learn more.

53 Responses to Predict Sentiment From Movie Reviews Using Deep Learning

  1. Vishal September 12, 2016 at 1:29 am #

    imdb.load_data(nb_words=5000, test_split=0.33)

    TypeError: load_data() got an unexpected keyword argument ‘test_split’

    The test_split argument doesn’t appear to exist in Keras 1.08, perhaps I’m doing something wrong?

    • Jason Brownlee September 12, 2016 at 8:33 am #

      The API has changed, sorry. I will update the example. You can remove the “test_split” argument.

      • Jason Brownlee October 7, 2016 at 2:06 pm #

        I have updated the example to match the API changes for Keras 1.1.0 and TensorFlow 0.10.0.

  2. Joe Williams October 3, 2016 at 12:52 am #

    Hi, Jason,

    Thanks for the great tutorial! How could I modify this to perform sentiment analysis based on user input? Or from a Twitter stream?

    Best wishes

    • Jason Brownlee October 3, 2016 at 5:19 am #

      You would have to encode the tweets using the same mapping from characters to integers.

      I do not have code for you to do this at the moment.

  3. Jie November 9, 2016 at 4:15 pm #

    1. embeddings are trainable, right? I mean, embeddings are dynamic, and they are changing during the training?

    2. How to save the embeddings to a file? We have to load the embeddings to predict new data in the future.

    3. I have a file named predict.py. I know how to load the model and graph architecture from here: http://machinelearningmastery.com/save-load-keras-deep-learning-models/
    But how to load embeddings in predict.py in order to predict new data?

    • Jason Brownlee November 10, 2016 at 7:38 am #

      Hi Jie, great questions.

      An embedding is a projection, they can be prepared from data. I would not say they are learned, but perhaps you could say that.

      I believe they can be prepared deterministically so we would not have to save them to file. I could be wrong about that, but that is my intuition.

  4. Maxim December 19, 2016 at 5:23 am #

    Jason, many thanks for your lessons! They’re amazing!!!!

    Maybe I ask you a very stupid question, but I can’t understand one thing. What is the Embedding layer? Could you show an example of it. I mean this is word vector with dimentionality of 500X32. So how it looks like?

    [0,0,1,….0,0,0] X 32
    .
    X500
    .
    [0,1,0,….0,0,0] X 32

    What digits in it? And why if we lift it dimensionality up to 64, the accuracy will rise up?

    Thank you!

  5. Martin December 27, 2016 at 7:27 am #

    test_split = 0.33

    This is defined, but not used anywhere in the code. why is that?

    • Jason Brownlee December 28, 2016 at 7:01 am #

      It is a typo and should be removed, thanks Martin. I will update the example soon.

  6. Anand December 28, 2016 at 7:16 pm #

    What params would you change typically to achieve what you have mentioned towards the end of the article to improve accuracy?

    To quote you: “set the max pooling layer to use an input length of 500. This would compress each feature map to a single 32 length vector”

    Could you please help what params (which lines) would need this change?

    • Jason Brownlee December 29, 2016 at 7:15 am #

      Hi Anand, use some trial and error and tune the method for the problem.

  7. Jim January 13, 2017 at 1:38 am #

    I am testing the predicted probabilities and values using model.predict(X_test) and model.predict_classes(X_test)

    I noticed that the predicted probabilities for the 0 class are all 0.5 w/ median predicted probability of 0.9673.

    Am I correct in assuming the model.predict returns probability of the 1 class always, and predicts a class of 0 when that probability is < 50%?

  8. brent January 22, 2017 at 4:59 am #

    When an element in the data is labeled as an integer, let’s say 4 for example, could that represent any word that has occurred 4 times in the data or is it a representation of a unique word?

    • Jason Brownlee January 22, 2017 at 5:13 am #

      Hi brent, each integer represents a unique word.

      • GT May 24, 2017 at 7:34 pm #

        Hi Jason,

        Thanks for the great tutorial, great as usual!

        You mentioned that “each integer represents a unique word”, why?

        My assumption is that we have mapped each word to its frequency in the whole corpus. If my assumption is true, so two words could come up with the same frequency. For example “Dog” and “Cat” both could repeat 10 times in the corpus.

        Could you please if my assumption is wrong?

        Thanks,

        • Jason Brownlee June 2, 2017 at 11:33 am #

          Words were ordered by frequency then assigned integers based on that frequency.

  9. Agustin February 3, 2017 at 9:15 am #

    Hello Jason, recently i became acquainted with the basics in machine learning and deep learning, in part thanks to the information provided in this site, which I found most insightful.
    However, lately I came upon a problem of generation simple and short questions automatically from a text. Due to my lack of knowledge and expertise i cant asses if it is possible to solve this problem with Deep Learning or any other method. Currently I have a database of several thousand questions based on around a hundred corpora that could be used as training data. Do you think I could get any successful results, and if so what approach will be the best? (Consider that even if it makes gibberish 50% of the time, it will still save a lot of work)

    • Jason Brownlee February 3, 2017 at 10:19 am #

      It does sound like a good deep learning problem Agustin.

      I’d recommend reading some papers on the topic to spark some ideas on methods and ways to represent the problem.

      Try searching on google scholar and on arvix.

  10. Chris Shumaker February 9, 2017 at 7:15 am #

    Hi, thank you for the example! Do you know the NumPy and matplotlib versions you were using in this example? I am having trouble with several methods like mean, std and box plot.

    • Jason Brownlee February 9, 2017 at 7:28 am #

      Hi Chris,

      This might be a Python 2 vs Python 3 issue, I used Python 2.

    • Chris Shumaker February 9, 2017 at 7:34 am #

      Actually, I am thinking that it is the call to map(). What version of Python are you using?

      • Chris Shumaker February 9, 2017 at 7:36 am #

        Sorry for post-spam. This works in Python 3:

        # Summarize review length
        print(“Review length: “)
        result = list(map(len, X))
        print(type(result))
        print(result)

        Python 3.5.2 :: Anaconda 4.1.1 (x86_64)
        Keras (1.2.1)
        tensorflow (0.12.1)
        numpy (1.11.2)
        matplotlib (1.5.3)

  11. Akshit Bhatia February 12, 2017 at 6:44 am #

    Hey,
    Can you give me some idea on how to implement other Deep Learning techniqeus such as recursive autoencoders(RAE) , RBM deep learning algorithm for sentiment analysis
    Any help would be appreciated :).

    Regards

    • Jason Brownlee February 13, 2017 at 9:08 am #

      Hi Akshit,

      I don’t have an example of RAE or RBM.

      This post has an example of sentiment analysis that you can use as a starting point.

  12. Zhang February 12, 2017 at 8:47 am #

    Hello, thanks for the example. I really appreciate you if you suggest me why I got this error.

    File “C:\Anaconda\lib\site-packages\theano-0.9.0.dev4-py2.7.egg\theano\gof\cmodule.py”, line 2236, in compile_str
    raise MissingGXX(“g++ not available! We can’t compile c code.”)

    MissingGXX: (‘The following error happened while compiling the node’, Shape_i{1}(embedding_2_W), ‘\n’, “g++ not available! We can’t compile c code.”, ‘[Shape_i{1}(embedding_2_W)]’)

    • Chri February 12, 2017 at 2:51 pm #

      @Zhang, looks like you have a beta version of Theano. If you’re just looking to get started, maybe you want to try a stable channel instead? Looks like your error is because you’re installing from source and your environment isn’t set up quite right.

    • Jason Brownlee February 13, 2017 at 9:09 am #

      Hi Zhang,

      It looks like g++ is not available. I’m not a windows user, I’m not sure how to interpret this message.

      Consider searching or posting on stack overflow.

  13. Chris February 12, 2017 at 2:55 pm #

    @Jason, thanks for your reply and thanks again for the post!

    I am having trouble improving the results on this model. I have changed the pool_length (500,250,125,375,5,10,15,20), tried adding another dense layer at size 250 and 500, and changed the number of epochs (25,50).

    Do you have any recommendations for tweaking the model? I tried the suggestions (deeper, larger, pool_length, and also number of epochs). Do you have any tips or reading suggestions for improving performance in general? This seems to be my last ‘wall’ to really being able to do ML.

    Thanks!

  14. Kiran March 3, 2017 at 4:34 pm #

    Hi jason, I removed the denser layer with 250 neurons and it reduced the number of parameters to be trained drastically with an increased accuracy of about 1% over 5 epochs. Any idea why you added 2 dense layers after flatten layer?

    • Jason Brownlee March 6, 2017 at 10:42 am #

      Well done Kiran.

      I came up with the configuration after some brief trial and error. It was not optimized.

  15. Dan March 12, 2017 at 12:02 am #

    Does it make sense to specify validation_data as X_test,y_test in the fit function if we evaluate our model in the scores function afterwards? Or could we skip specify validation_data in model.fit(…)?

    • Jason Brownlee March 12, 2017 at 8:28 am #

      No, you will get this for free when fitting your model. The validation data should be different from training data and is completely optional.

  16. Harish March 23, 2017 at 5:54 pm #

    What is the accuracy of this approach? Which approach is better to get accuracy of at least 0.89?

  17. Prakash April 12, 2017 at 8:23 am #

    I tried the sentiment analysis with convolutional NNs and LSTMs and find the CNNs give higher accuracy. Any insight into why?

    • Jason Brownlee April 12, 2017 at 9:35 am #

      Some ideas:

      Perhaps the CNNs are better at capturing the spatial relationships.

      Perhaps the LSTM needs to be larger and trained for longer to achieve the same skill.

  18. Patrick April 17, 2017 at 9:20 pm #

    Please add these to the imports

    from keras.preprocessing import sequence
    from keras.layers.embeddings import Embedding

    • Jason Brownlee April 18, 2017 at 8:32 am #

      These are listed in the imports for the section titled “One-Dimensional Convolutional Neural Network Model for the IMDB Dataset”

  19. Trang April 27, 2017 at 2:21 pm #

    Hi, Jason, i have the same question with Maxim. Can you tell me why is that.Thank you
    Maybe I ask you a very stupid question, but I can’t understand one thing. What is the Embedding layer? Could you show an example of it. I mean this is word vector with dimentionality of 500X32. So how it looks like?

    [0,0,1,….0,0,0] X 32
    .
    X500
    .
    [0,1,0,….0,0,0] X 32

    What digits in it? And why if we lift it dimensionality up to 64, the accuracy will rise up?

    Thank you!

  20. Ahmed May 10, 2017 at 3:19 am #

    Thanks So So Much For This Awesome Tutorial .
    However , I’m Asking About How To Use This To Predict The Opinion
    I Still Don’t know , for instance i need to know the movie is good or bad , or if i used a twitter data set i need to know the public opinion summary about a specific hashtag or topic

    I tried More And More But i Failed as i’m still Beginner

    Thanks In Advance <3

  21. Hamza May 24, 2017 at 5:30 am #

    Well, thank you so much for this great work.

    I have a question here. I didn’t understand why we use ReLU instead of tanh as the activation function. Most people use SGD or backpropagation for training. What did we use here? I do not know about ADAM. Can you please explain why did you use it for training?

    • Jason Brownlee June 2, 2017 at 11:28 am #

      ReLU has better properties than sigmoid or tanh and has become the new defacto standard.

      We did use SGD to fit the model, we just used a more fancy version called Adam.

  22. Adnan June 12, 2017 at 2:49 am #

    Hi, thanks a lot for this HELPFUL tutorial. I have a question, could be an odd one. what if we use pre-trained word2vec model. I mean if we just use pre-trained word2vec model and train our neural network with movie reviews data. Correct me if I am wrong!

    Or best way is to train word2vec with movie reviews data, then train neural network with same movie reviews data then try.

    Kindly guide me. Thanks

    • Jason Brownlee June 12, 2017 at 7:11 am #

      Sounds great, try it.

      • Adnan June 12, 2017 at 7:53 pm #

        but should I use pre-trained word2vec model (trained with wiki data) or train from scratch with by using movie reviews data or amazon product reviews data. thanks

  23. Alok June 22, 2017 at 12:42 am #

    Hi Jason,

    I am trying to use tfidf matrix as input to my cnn for document classification. I don’t want to use embedding layer. Can you help me how this can be achieved. I have not seen any example which shows tfidf as input to Cov1d layer in Keras. Please help

Leave a Reply