How to Develop a Multichannel CNN Model for Text Classification

A standard deep learning model for text classification and sentiment analysis uses a word embedding layer and one-dimensional convolutional neural network.

The model can be expanded by using multiple parallel convolutional neural networks that read the source document using different kernel sizes. This, in effect, creates a multichannel convolutional neural network for text that reads text with different n-gram sizes (groups of words).

In this tutorial, you will discover how to develop a multichannel convolutional neural network for sentiment prediction on text movie review data.

After completing this tutorial, you will know:

  • How to prepare movie review text data for modeling.
  • How to develop a multichannel convolutional neural network for text in Keras.
  • How to evaluate a fit model on unseen movie review data.

Kick-start your project with my new book Deep Learning for Natural Language Processing, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

  • Update Feb/2018: Small code change to reflect changes in Keras 2.1.3 API.
  • Update Aug/2020: Updated link to movie review dataset.
How to Develop an N-gram Multichannel Convolutional Neural Network for Sentiment Analysis

How to Develop an N-gram Multichannel Convolutional Neural Network for Sentiment Analysis
Photo by Ed Dunens, some rights reserved.

Tutorial Overview

This tutorial is divided into 4 parts; they are:

  1. Movie Review Dataset
  2. Data Preparation
  3. Develop Multichannel Model
  4. Evaluate Model

Python Environment

This tutorial assumes you have a Python 3 SciPy environment installed.

You must have Keras (2.0 or higher) installed with either the TensorFlow or Theano backend.

The tutorial also assumes you have scikit-learn, Pandas, NumPy, and Matplotlib installed.

If you need help with your environment, see this post:

Need help with Deep Learning for Text Data?

Take my free 7-day email crash course now (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

Movie Review Dataset

The Movie Review Data is a collection of movie reviews retrieved from the website in the early 2000s by Bo Pang and Lillian Lee. The reviews were collected and made available as part of their research on natural language processing.

The reviews were originally released in 2002, but an updated and cleaned up version was released in 2004, referred to as “v2.0”.

The dataset is comprised of 1,000 positive and 1,000 negative movie reviews drawn from an archive of the newsgroup hosted at The authors refer to this dataset as the “polarity dataset.”

Our data contains 1000 positive and 1000 negative reviews all written before 2002, with a cap of 20 reviews per author (312 authors total) per category. We refer to this corpus as the polarity dataset.

A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts, 2004.

The data has been cleaned up somewhat; for example:

  • The dataset is comprised of only English reviews.
  • All text has been converted to lowercase.
  • There is white space around punctuation like periods, commas, and brackets.
  • Text has been split into one sentence per line.

The data has been used for a few related natural language processing tasks. For classification, the performance of machine learning models (such as Support Vector Machines) on the data is in the range of high 70% to low 80% (e.g. 78%-82%).

More sophisticated data preparation may see results as high as 86% with 10-fold cross-validation. This gives us a ballpark of low-to-mid 80s if we were looking to use this dataset in experiments of modern methods.

… depending on choice of downstream polarity classifier, we can achieve highly statistically significant improvement (from 82.8% to 86.4%)

A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts, 2004.

You can download the dataset from here:

After unzipping the file, you will have a directory called “txt_sentoken” with two sub-directories containing the text “neg” and “pos” for negative and positive reviews. Reviews are stored one per file with a naming convention cv000 to cv999 for each neg and pos.

Next, let’s look at loading and preparing the text data.

Data Preparation

In this section, we will look at 3 things:

  1. Separation of data into training and test sets.
  2. Loading and cleaning the data to remove punctuation and numbers.
  3. Prepare all reviews and save to file.

Split into Train and Test Sets

We are pretending that we are developing a system that can predict the sentiment of a textual movie review as either positive or negative.

This means that after the model is developed, we will need to make predictions on new textual reviews. This will require all of the same data preparation to be performed on those new reviews as is performed on the training data for the model.

We will ensure that this constraint is built into the evaluation of our models by splitting the training and test datasets prior to any data preparation. This means that any knowledge in the data in the test set that could help us better prepare the data (e.g. the words used) is unavailable in the preparation of data used for training the model.

That being said, we will use the last 100 positive reviews and the last 100 negative reviews as a test set (100 reviews) and the remaining 1,800 reviews as the training dataset.

This is a 90% train, 10% split of the data.

The split can be imposed easily by using the filenames of the reviews where reviews named 000 to 899 are for training data and reviews named 900 onwards are for test.

Loading and Cleaning Reviews

The text data is already pretty clean; not much preparation is required.

Without getting bogged down too much by the details, we will prepare the data in the following way:

  • Split tokens on white space.
  • Remove all punctuation from words.
  • Remove all words that are not purely comprised of alphabetical characters.
  • Remove all words that are known stop words.
  • Remove all words that have a length <= 1 character.

We can put all of these steps into a function called clean_doc() that takes as an argument the raw text loaded from a file and returns a list of cleaned tokens. We can also define a function load_doc() that loads a document from file ready for use with the clean_doc() function. An example of cleaning the first positive review is listed below.

Running the example loads and cleans one movie review.

The tokens from the clean review are printed for review.

Clean All Reviews and Save

We can now use the function to clean reviews and apply it to all reviews.

To do this, we will develop a new function named process_docs() below that will walk through all reviews in a directory, clean them, and return them as a list.

We will also add an argument to the function to indicate whether the function is processing train or test reviews, that way the filenames can be filtered (as described above) and only those train or test reviews requested will be cleaned and returned.

The full function is listed below.

We can call this function with negative training reviews as follows:

Next, we need labels for the train and test documents. We know that we have 900 training documents and 100 test documents. We can use a Python list comprehension to create the labels for the negative (0) and positive (1) reviews for both train and test sets.

Finally, we want to save the prepared train and test sets to file so that we can load them later for modeling and model evaluation.

The function below-named save_dataset() will save a given prepared dataset (X and y elements) to a file using the pickle API.

Complete Example

We can tie all of these data preparation steps together.

The complete example is listed below.

Running the example cleans the text movie review documents, creates labels, and saves the prepared data for both train and test datasets in train.pkl and test.pkl respectively.

Now we are ready to develop our model.

Develop Multichannel Model

In this section, we will develop a multichannel convolutional neural network for the sentiment analysis prediction problem.

This section is divided into 3 parts:

  1. Encode Data
  2. Define Model.
  3. Complete Example.

Encode Data

The first step is to load the cleaned training dataset.

The function below-named load_dataset() can be called to load the pickled training dataset.

Next, we must fit a Keras Tokenizer on the training dataset. We will use this tokenizer to both define the vocabulary for the Embedding layer and encode the review documents as integers.

The function create_tokenizer() below will create a Tokenizer given a list of documents.

We also need to know the maximum length of input sequences as input for the model and to pad all sequences to the fixed length.

The function max_length() below will calculate the maximum length (number of words) for all reviews in the training dataset.

We also need to know the size of the vocabulary for the Embedding layer.

This can be calculated from the prepared Tokenizer, as follows:

Finally, we can integer encode and pad the clean movie review text.

The function below named encode_text() will both encode and pad text data to the maximum review length.

Define Model

A standard model for document classification is to use an Embedding layer as input, followed by a one-dimensional convolutional neural network, pooling layer, and then a prediction output layer.

The kernel size in the convolutional layer defines the number of words to consider as the convolution is passed across the input text document, providing a grouping parameter.

A multi-channel convolutional neural network for document classification involves using multiple versions of the standard model with different sized kernels. This allows the document to be processed at different resolutions or different n-grams (groups of words) at a time, whilst the model learns how to best integrate these interpretations.

This approach was first described by Yoon Kim in his 2014 paper titled “Convolutional Neural Networks for Sentence Classification.”

In the paper, Kim experimented with static and dynamic (updated) embedding layers, we can simplify the approach and instead focus only on the use of different kernel sizes.

This approach is best understood with a diagram taken from Kim’s paper:

Depiction of the multiple-channel convolutional neural network for text

Depiction of the multiple-channel convolutional neural network for text.
Taken from “Convolutional Neural Networks for Sentence Classification.”

In Keras, a multiple-input model can be defined using the functional API.

We will define a model with three input channels for processing 4-grams, 6-grams, and 8-grams of movie review text.

Each channel is comprised of the following elements:

  • Input layer that defines the length of input sequences.
  • Embedding layer set to the size of the vocabulary and 100-dimensional real-valued representations.
  • One-dimensional convolutional layer with 32 filters and a kernel size set to the number of words to read at once.
  • Max Pooling layer to consolidate the output from the convolutional layer.
  • Flatten layer to reduce the three-dimensional output to two dimensional for concatenation.

The output from the three channels are concatenated into a single vector and process by a Dense layer and an output layer.

The function below defines and returns the model. As part of defining the model, a summary of the defined model is printed and a plot of the model graph is created and saved to file.

Complete Example

Pulling all of this together, the complete example is listed below.

Running the example first prints a summary of the prepared training dataset.

Next, a summary of the defined model is printed.

The model is fit relatively quickly and appears to show good skill on the training dataset.

A plot of the defined model is saved to file, clearly showing the three input channels for the model.

Plot of the Multichannel Convolutional Neural Network For Text

Plot of the Multichannel Convolutional Neural Network For Text

The model is fit for a number of epochs and saved to the file model.h5 for later evaluation.

Evaluate Model

In this section, we can evaluate the fit model by predicting the sentiment on all reviews in the unseen test dataset.

Using the data loading functions developed in the previous section, we can load and encode both the training and test datasets.

We can load the saved model and evaluate it on both the training and test datasets.

The complete example is listed below.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running the example prints the skill of the model on both the training and test datasets.

We can see that, as expected, the skill on the training dataset is excellent, here at 100% accuracy.

We can also see that the skill of the model on the unseen test dataset is also very impressive, achieving 87.5%, which is above the skill of the model reported in the 2014 paper (although not a direct apples-to-apples comparison).


This section lists some ideas for extending the tutorial that you may wish to explore.

  • Different n-grams. Explore the model by changing the kernel size (number of n-grams) used by the channels in the model to see how it impacts model skill.
  • More or Fewer Channels. Explore using more or fewer channels in the model and see how it impacts model skill.
  • Deeper Network. Convolutional neural networks perform better in computer vision when they are deeper. Explore using deeper models here and see how it impacts model skill.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.


In this tutorial, you discovered how to develop a multichannel convolutional neural network for sentiment prediction on text movie review data.

Specifically, you learned:

  • How to prepare movie review text data for modeling.
  • How to develop a multichannel convolutional neural network for text in Keras.
  • How to evaluate a fit model on unseen movie review data.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Develop Deep Learning models for Text Data Today!

Deep Learning for Natural Language Processing

Develop Your Own Text models in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Deep Learning for Natural Language Processing

It provides self-study tutorials on topics like:
Bag-of-Words, Word Embedding, Language Models, Caption Generation, Text Translation and much more...

Finally Bring Deep Learning to your Natural Language Processing Projects

Skip the Academics. Just Results.

See What's Inside

214 Responses to How to Develop a Multichannel CNN Model for Text Classification

  1. Avatar
    Hitkul January 12, 2018 at 5:54 pm #

    Great article
    I think there is a small error, According to this code trainLines would be a list of list, where each list holds tokens for one review. But all the functions to which trainLines is passed(i.e texts_to_sequences, fit_on_texts, max_length) takes a list of strings as input. I think all the lists in trainX and testX should be converted to strings before dumping them into a file.

    • Avatar
      Jason Brownlee January 13, 2018 at 5:30 am #

      The list of tokens is turned back into a string:

      tokens = ' '.join(tokens)
      • Avatar
        Minel March 5, 2019 at 9:27 pm #

        Hello Jason
        yes but this instruction is missing in your code
        I got a traceback with this message list object has no attribute lower

        Si I add one line in clean_doc function
        tokens = [word for word in tokens if len(word) > 1]
        tokens = ‘ ‘.join(tokens)
        return tokens
        It works well

  2. Avatar
    Marko Plahuta January 12, 2018 at 6:41 pm #

    thanks for the article. Would it be possible to use only one input and one embedding layer, and branch into convolutions after that?

    • Avatar
      Jason Brownlee January 13, 2018 at 5:31 am #

      How would that work?

      • Avatar
        Francesco January 25, 2018 at 9:13 pm #

        Wouldn’t the model below be exactly the same? (with just one input, used in the three channels?) Of course, if we also had a single embedding (put the embedding before the channels, and let all convs act on that one) the model would be different.

        • Avatar
          Jason Brownlee January 26, 2018 at 5:40 am #

          Yes, nice approach.

          How does it compare regarding training/prediction?

          • Avatar
            HSA February 5, 2020 at 9:41 am #

            given that I run the code using different dataset and embedding, Mr.Jason code gave me:

            ModelMulitCNN MODEL Accuracy: 0.8785238339313173
            ModelMulitCNN MODEL precision_score: 0.7633378932968536
            ModelMulitCNN MODEL recall_score: 0.6495925494761351
            ModelMulitCNN MODEL f1_score: 0.7018867924528303

            while Mr.Francesco code gave me:

            ModelMulitCNN MODEL Accuracy: 0.8782675550999487
            ModelMulitCNN MODEL precision_score: 0.7424242424242424
            ModelMulitCNN MODEL recall_score: 0.6845168800931315
            ModelMulitCNN MODEL f1_score: 0.7122955784373107

          • Avatar
            Jason Brownlee February 5, 2020 at 1:41 pm #

            Nice, thanks for sharing.

          • Avatar
            Tam June 13, 2020 at 4:28 am #

            Hi! I followed your tutorial with my own dataset and I changed the last layer for regression task. I got an error like this:

            AssertionError: Could not compute output Tensor(“dense_19/Identity:0”, shape=(None, 1), dtype=float32)

            How can I fix it? Thank you so much.

          • Avatar
            Jason Brownlee June 13, 2020 at 6:10 am #

            You will need to debug the error, I cannot tell the cause off the cuff.

          • Avatar
            Tam June 13, 2020 at 2:04 pm #

            I have tried to debug the error, but it has no improvement. I followed Francesco’s modification when using 1 input layer for all 3 channels, it works.

            In the define model:

            model = Model(inputs=[inputs], outputs=outputs)


            If I use 1 layer for all 3 channels, it works well. But if I use separated inputs as you used:

            inputs1 = Input(shape=(length,))
            embedding1 = Embedding(vocab_size, 100)(inputs1)

            inputs2 = Input(shape=(length,))
            embedding2 = Embedding(vocab_size, 100)(inputs2)

            inputs3 = Input(shape=(length,))
            embedding3 = Embedding(vocab_size, 100)(inputs3)

            model = Model(inputs=[inputs1, inputs2, inputs3], outputs=outputs)


            it didn’t work and got the above error. I wonder if there is any problem with this line: model = Model(inputs=[inputs1, inputs2, inputs3], outputs=outputs) ?

            Thank you!

          • Avatar
            Jason Brownlee June 14, 2020 at 6:30 am #

            Perhaps try posting your code and error on stackoverflow.

      • Avatar
        Nadeem Anwar October 28, 2021 at 1:25 am #

        I’m new to deep learning. Want to develop a CNN model from scratch for text detection in scene images. Can you help me?

  3. Avatar
    Adnan ÖNCEVARLIK January 13, 2018 at 12:31 am #

    Thank you for your effort and good clean article.

  4. Avatar
    Sébastien January 14, 2018 at 3:18 am #

    FYI, the link to the review polarity dataset is wrong. The correct one is:

  5. Avatar
    Ahmet January 14, 2018 at 3:23 am #

    Hi Jason,
    Thanks for your this beautiful work. I want to ask you that we are able to evaluate the accuracy of the model but how can we predict the classes of tested documents to analyze results in detail. In a Sequential model we can perform it like model.predict_classes(x_test). However for the Model(inputs=…) object predict_classes feature is not supported. Do you have a suggestion?

    • Avatar
      Jason Brownlee January 14, 2018 at 6:40 am #

      Yes, the document must be provided in an array 3 times.

      yhat = model.predict([doc, doc, doc])

      • Avatar
        ahmet January 14, 2018 at 7:53 am #

        Yes I also tried like that but I am not sure how to interpret the probabilities for multiclass classification.

        • Avatar
          Jason Brownlee January 15, 2018 at 6:54 am #

          What is the problem exactly?

          • Avatar
            Ahmet January 15, 2018 at 7:29 pm #

            Thanks for interest, Jason.
            I am actually trying to apply your post for a multiclass case (0:negative,1:neutral,2:positive). After training part I want to compare the accuracy rates for each class to measure the how model is accurate in detail. Then I want to calculate the Precision and Recall. However, I can not use predict_classes() function in Model() object it is just allowed for Sequential() object. When I prefer the fuction model.predict([test_doc, test_doc, test_doc]) it gives me some probabilities below but I am not sure how to map them to class labels (0,1,2).


            [0.7408592 ]

          • Avatar
            Jason Brownlee January 16, 2018 at 7:32 am #

            You can use argmax on the result from predict() to get a class index.

    • Avatar
      rango July 4, 2019 at 2:05 pm #

      hi Jason,
      I also try your post for text classification.However when I try to apply argmax() function like you said I get an array of 0.

      array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0])

      Do you Have any suggestion? Thank in advance.

      Before predict class I prepare text sample like this:

      sample = “This is sample text for classification”
      sequences = tokenizer.texts_to_sequences(sample)
      test_data = pad_sequences(sequences, maxlen=maxlen)

      predict = model.predict([test_data,test_data,test_data])

      If I do something wrong plz correct me! Thank You

  6. Avatar
    Rui January 14, 2018 at 4:33 am #

    I have tried similar architectures and came to the conclusion that this type of architecture (with parallel paths) are not good because when the error is back propagated one of the good paths ,that would be learning in the right way, will be affected by a bad path that will increase tha global error. So one good path would be in the right way but it will think it is a bad learning because the global error will increase . (is this ideia right ? )

    whe I have “parallel layers” I have to find a way to pretrain before … and the freeze the values and add this layers to the model.

  7. Avatar
    JP Raymond January 14, 2018 at 12:34 pm #

    The architecture described in Yoon Kim’s paper:
    – has one embedding layer (not one for each branch),
    – uses global (max-over-time) pooling (not with a pool of size 2),
    – applies dropout once, on the concatenation of the max features (not in each branch before the pooling operation).

    Do the changes proposed here yield a better predictive performance?

  8. Avatar
    Dev January 18, 2018 at 3:10 am #

    Hi Jason,

    Thanks for this article. Very helpful.

    I am facing an issue with the output, like getting different results everytime when I run the same model. The loss and accuracy are also changing all the time.

    Any thoughts on this ?

  9. Avatar
    Ramzy January 21, 2018 at 7:40 am #

    Hi Jason,

    Thanks for the great tutorial, I performed all the steps and in the last step when I tried to evaluate the model, I got the below results!

    Train Accuracy: 50.000000
    Test Accuracy: 50.000000
    What does this mean! I think this is not logical.

    Also I got some warning like the below while running the evaluation code.

    2018-01-20 22:21:41.333198: W c:\tf_jenkins\home\workspace\release-win\m\windows
    \py\36\tensorflow\core\platform\] The TensorFlow library
    wasn’t compiled to use AVX2 instructions, but these are available on your machin
    e and could speed up CPU computations.
    2018-01-20 22:21:41.333808: W c:\tf_jenkins\home\workspace\release-win\m\windows
    \py\36\tensorflow\core\platform\] The TensorFlow library
    wasn’t compiled to use FMA instructions, but these are available on your machine
    and could speed up CPU computations.

    I’ll appreciate your reply. Thanks in advance

  10. Avatar
    Mahfuza January 24, 2018 at 11:40 am #

    Hi, First of all thank you so much for the article.

    I tried to utilize the multichannel concept in a different way. I am working with relation extraction in natural language where relations are marked already but needs to figure out whether or not there is a valid sentence structure in between those entities. I am thinking to get different representations of a same sentence. For an example sentence (Paris is the capital of France), the real words of the sentence (Paris, is, the, capital, of, France) as first channel’s data, the POS tags of those same tokens (propn, verb, det, noun, adp, propn) as second channel’s data and, dependency tree tags (nsubj, root, det, attr, prep, pobj) as third channel’s data.

    I am confused whether I need to encode all the words, pos tags, and dep tags thinking them as a part of the same vocabulary? Or I need to encode those different tokens/representations in their own vocabulary scope?


    • Avatar
      Jason Brownlee January 24, 2018 at 4:39 pm #

      Interesting approach.

      I guess you could try a few different representations and test whether there is any impact on model skill?

  11. Avatar
    Mahfuza January 25, 2018 at 5:51 am #

    Actually I am confused on the text to integer encoding step. I mean, should I consider all the different representations of the same sentence as a single list of documents? For example lets say I got three different representations of the sentence “Paris is the capital of France” and collected them in a single document list like below –

    all_docs = [[Paris is the capital of France]…, [propn verb det noun adp propn]…, [nsubj root det attr prep pobj]…]

    Now if I fit the tokenizer like tokenizer.fit_on_texts(all_docs) into these documents then all the tokens are having different unique integers, right? Is it then valuable if I feed those three encoded representations separately into different channels like input1 = [ints of tokens], input2 = [more ints of tokens], input3 = [some ints of tokens]?


    I should consider fitting the tokenizer in three different representations separately and then encode them separately based on the fitted tokenizer. I think the tokenizer would provide some common integers then. Because the encoding scopes are isolated. Will that help in the concatenated layer some how?

    Thanks again Jason for your reply and effort for this blog.

  12. Avatar
    Francesco January 25, 2018 at 9:19 pm #

    Hi Jason,

    would it be possible, looking at the activation patterns, to identify the n-ngrams that mostly affected the classification? With pictures it is possible, but I suppose that with text it should be very difficult, with the embedding inbetween…


    • Avatar
      Jason Brownlee January 26, 2018 at 5:40 am #

      It may be, but some hard thinking and development would be required. It’s not obvious sorry.

  13. Avatar
    Massa February 1, 2018 at 7:32 am #


    Thank you for this neat article! 🙂

    I am facing an issue.

    AttributeError: ‘int’ object has no attribute ‘ndim’

    • Avatar
      Jason Brownlee February 2, 2018 at 8:02 am #

      Did you try coping all of the code from the final example?

    • Avatar
      Mia February 9, 2018 at 5:31 pm #

      I also had the same issue. The reason is that variable is a list object while Keras expects a numpy array. I solved the problem by importing numpy and inserting “trainLabels = numpy.array(trainLabels)”

      • Avatar
        Mia February 9, 2018 at 5:35 pm #

        i mean variable trainLabels .

        • Avatar
          Jason Brownlee February 14, 2018 at 11:03 am #

          I have update the example to correct this issue.

          It seems to be a change in Keras 2.1.3.

  14. Avatar
    JanneK February 5, 2018 at 5:44 am #

    Thank for this useful post Jason. It’s interesting that even such a minimalistic preprocessing always seem to work fine. When doing text classification (or regression) tasks, do you generally see any value in keeping/including additional information during preprocessing, such as:

    (1) Basic sentence-separating punctuations, such as “.”, “!” and “?”
    (2) Paragraph separators between groups of sentences
    (3) POS tag augmented tokens, e.g., “walking” –> “walking_VERB”, “apple” –> “apple_NOUN”

    Do these typically have any significant impact?

    • Avatar
      Jason Brownlee February 5, 2018 at 7:53 am #

      It really depends on the application I’m afraid.

      For simple classification like tasks, often simpler representations result in better model skill.

  15. Avatar
    Jason H February 7, 2018 at 7:13 am #

    You should be able to comment out inputs2, embedding2, inputs3 and embedding3, and then feed embedding1 to conv2 and conv3, right? Then you go from [x_in, x_in, x_in] to just x_in for fit(), and predict().

  16. Avatar
    Franco Arda February 14, 2018 at 7:01 pm #

    Hi Jason,

    In your NLP book (Chapter 16), you cover n-gram CNN in depth. LSTM’s memory mimic n-grams.

    May I ask you why you not use a LSTM classification instead?


    • Avatar
      Jason Brownlee February 15, 2018 at 8:39 am #

      You can use LSTMs, here I was demonstrating how to do it with CNNs.

      • Avatar
        Franco February 15, 2018 at 8:16 pm #

        Thanks Jason!

        Indeed, you have a NLP sample in your LSTM book.

        • Avatar
          Boris February 21, 2018 at 2:15 am #

          Hi Jason,

          Thanks for yet another amazing post. Apart from purely educational purposes, what do you think would be the differences between using LSTM and CNN in sentiment analysis? Are both approaches completely interchangeable or each of the models might hold advantages/limitations in different setups.


          • Avatar
            Jason Brownlee February 21, 2018 at 6:41 am #

            CNN seems to achieve state of the art results. I would start there.

  17. Avatar
    Satheesh March 24, 2018 at 10:25 pm #

    Can we apply this approach to multi class problems? The only change is during fitting and using ‘categorical cross entropy” as loss.Am I correct

  18. Avatar
    Marwa May 5, 2018 at 3:24 am #

    Hi Jason,
    I am beginner on keras and CNN ,I want to know How to give multinputs to train_test_split?

    my example:(Xtrain_user,Xtrain_item,Xtest_user,Xtest_item),y_train,y_test=train_test_split((user_reviews,
    item_reviews),rating , test_size=0.2, random_state=42)

    which X= (user_reviews,item_reviews)
    and Y=rating

  19. Avatar
    Junaid Akhtar May 11, 2018 at 3:19 pm #

    First of all, thank you for what you did here, and generally what you do 🙂

    I tested your code on the imdb data, got the similar results, then changed the output to 2 neurons with softmax/argmax training/testing, got marginally better results.

    Then I switched to my own 5-class sentiment dataset, but got 98%-ish traning accuracy but 40%-ish testing accuracy, which means its over-fitting, but then read in many papers that generally in the 5-grained SST1 dataset everyone reports an accuracy of 40%-ish. With binary +/- sentiments of course everyone reaches late 80s-90s in accuracy.

    What do you suggest to that? Or a setting I could try on top of your code.

    • Avatar
      Jason Brownlee May 12, 2018 at 6:26 am #

      The model architecture may require tuning to each different problem.

  20. Avatar
    Ethan Schmit May 31, 2018 at 7:53 pm #

    Hi Jason ,

    Thanks for the great article.

    If I have more than 2 classes, e.g. positive/negative/neutral, what will my output list (trainLabels) look like? Will it be a a list containing 0’s, 1’s and 2’s or will it be a n*3 matrix, with each column containing 0’s and 1’s?

    Thanks a lot

  21. Avatar
    az June 28, 2018 at 4:01 am #

    i have created a network which takes two sequences of integers (2 inputs) for one sentence (one related to word embeddings and other related to POS tags) and corresponding embedding layers and then merges them both together before applying the convolution layer. I had called it multi input model.

    inp = Input(shape=(data.shape[1],))
    e= Embedding(vocab_size, 200, input_length=maxSeqLength, weights=[embedding_matrix], trainable=False)(inp)
    inp2 = Input(shape=(embedding_pos.shape[1],))
    e1= Embedding(54, 54, input_length=maxSeqLength, weights=[embedding_matrix_pos], trainable=False)(inp2)


    conv1_1 = Conv1D(filters=100, kernel_size=3)(mer)
    conv1_2 = Conv1D(filters=100, kernel_size=4)(mer)

    From this article i understood that it might be multichannel. I am confused that is multiple inputs is necessarily multichannel or multiple parallel convolution layers even on same input is multichannel? If i consider the second definition i would call my network multi-input multichannel?
    Also in this case all the convolutions are applied to original input.What is the difference if i stack up convolutions so the result of one is input to another?Thanks.

    • Avatar
      Jason Brownlee June 28, 2018 at 6:25 am #

      If the inputs are different, it might be better to have a multi-headed CNN rather than a multichannel CNN.

  22. Avatar
    az June 28, 2018 at 8:51 am #

    The inputs are sequence of the same sentence and both are padded to same would it be called different inputs still? and can you please clarify if multichannel is multi-input regardless of parallel or sequential convolutions?

    • Avatar
      Jason Brownlee June 28, 2018 at 2:07 pm #

      Each head of the model will “read” the data differently.

      Perhaps referring to the multiple inputs to the model as “channels” was a poor choice. Different inputs as in the above model are indeed different to different channels for one input. Importantly, we can use different sized kernels and number of filters with different inputs, whereas each channel on one input is fixed to use the same kernel and number of filters.

    • Avatar
      az July 12, 2018 at 4:45 am #

      Thanks alot. I understand that now that multichannel means one input with one or more embeddings for that one input.In your example you are using input multiple times but still its the same input. I have taken your advice and applied convolution of different filter size to each embedding separately instead of merging them together which has improved accuracy. Only one thing,it means it is multi input, i can understand how it can be seen as multi basically having multiple inputs makes it multiheaded?Thanks

      • Avatar
        Jason Brownlee July 12, 2018 at 6:30 am #


        • Avatar
          az July 28, 2018 at 4:11 am #

          Thank you so much! The precision and recall graphs i plotted through training epochs show sudden spikes in precision while gradual increase in recall.Though recall is higher than precision. Now my dataset has more positive samples than negative ones which lead me to believe that there is chance higher number of FN than FP i.e. lower recall and higher precision but my results are opposite. Can you please elaborate if i have the wrong understanding?

          • Avatar
            az July 28, 2018 at 4:44 am #

            or should i see it this way that since it has more positive samples,classifier is biased towards positive class,leading to more FP and hence lower precision and higher recall?Thanks

          • Avatar
            Jason Brownlee July 28, 2018 at 6:40 am #

            You will need to find a trade-off that makes sense for your specific application and problem.

  23. Avatar
    Yuval July 11, 2018 at 11:20 pm #

    I plotted precision-recall curve for both pos and neg class and found the results interesting,
    While the curve for the pos class looks very good with oprtimal point at 0.9 and 0.8 for recall and precision respecitvally. The curve for the neg class is a stright line that gives 0.6 for precision and recall or 0.8 with 0.6 and vice versa.
    Any idea how this could be?

    • Avatar
      Jason Brownlee July 12, 2018 at 6:25 am #

      It may suggest that one class is easier to predict than the other.

  24. Avatar
    Zenon Uchida July 22, 2018 at 9:39 pm #

    I did the exact thing on the tutorial but i got a lower test accuracy which is 83.5%. Traning is 100%

    • Avatar
      Jason Brownlee July 23, 2018 at 6:09 am #

      You could try running the example a few times to see if you get differing results.

      • Avatar
        Zenon Uchida August 5, 2018 at 2:45 am #

        (forgot to reply) Yes, actually got different results after running it a few times.

  25. Avatar
    andi wijaya July 28, 2018 at 2:03 am #

    def max_length(lines):
    return max([len(s.split()) for s in lines])
    I just copy and paste the code and have this error, I believe that code is actually right, I dont know why I got an error
    AttributeError: ‘list’ object has no attribute ‘split’

    Python 3.6
    Tensorflow 1.9
    Keras 2.2

  26. Avatar
    Zenon Uchida August 4, 2018 at 11:59 pm #

    I have a question on the “Plot of the Multichannel Convolutional Neural Network For Text”.
    Why is there always a “none” per box? What does it denote to?

    • Avatar
      Zenon Uchida August 5, 2018 at 2:02 am #

      Found the answer: The None dimension in the shape tuple refers to the batch dimension which simply means that the layer can accept input of any size.

    • Avatar
      Jason Brownlee August 5, 2018 at 5:31 am #

      Good question!

      “None” refers to a dimension that is not specified, and in turn is variable.

  27. Avatar
    Zenon Uchida August 5, 2018 at 2:50 am #

    I have a question. Based on this paper, (A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification), they stated “we set the number of feature maps for this region size to 100”. How do i set the number of feature maps? Is the number of feature maps based on the number of filters per region size?

    • Avatar
      Jason Brownlee August 5, 2018 at 5:35 am #

      The number of feature maps is the first argument to the convolutional layer Conv1D().

  28. Avatar
    Zenon Uchida August 17, 2018 at 1:39 am #

    I really got lots of questions sorry xD.

    What’s an intuitive way of understanding how this multi-channeled CNN (or perhaps a basic CNN architecture in general) identify the following (given that the sentiment expressed are only positive or negative):
    1.) If more than one sentiment is expressed in the tweet but the positive sentiment is more dominant, then it is a positive tweet, or
    2.) If more than one sentiment is expressed in the tweet but the negative sentiment is more dominant, then it is a negative tweet.

    • Avatar
      Jason Brownlee August 17, 2018 at 6:34 am #

      We cannot know what the CNN has learned, only the evaluation of the model performance.

      • Avatar
        Zenon Uchida August 17, 2018 at 3:40 pm #

        What I’m trying to understand here is how CNN is applied to sentence classification. I still don’t get the idea. What i know is CNN when it comes to computer vision, it finds features of pictures. For example a dog with features such ears, nose and eyes.

        • Avatar
          Zenon Uchida August 17, 2018 at 4:56 pm #

          I think i got.
          In convolutional neural networks every network layer acts as a detection filter for the presence of specific features present in the original data. The first layer in a CNN detect (large) features that can be recognized and interpreted relatively easy. Subsequent layers detect increasingly (smaller) features that are more abstract. The last layer of the CNN is able to make an ultra-specific classification by combining all the specific features detected by the previous layers in the input data.

  29. Avatar
    Ishay Telavivi August 19, 2018 at 1:23 am #


    I have a question about the input of the model, based on an error I got.

    I ran this code on a different data and I had no problems. Then I tried an extention with RandomizedSearchCV, as the following:

    model = KerasClassifier(build_fn=create_model, verbose=1, epochs=3, batch_size=32)
    param_dist= {“n_strides”: sp_randint(1,3)}
    random_grid = RandomizedSearchCV(estimator=model, param_distributions=param_dist, n_iter = 3)
    random_grid_result =[X_train, X_train, X_train], y_train)

    I got the following error:
    ValueError: Found input variables with inconsistent numbers of samples: [3, 25000]

    What may the error be? 25000 is the length of my train data

    • Avatar
      Jason Brownlee August 19, 2018 at 6:26 am #

      Sorry, I’m not sure what is going on.

      Perhaps post your code and error to stackoverflow?

      • Avatar
        Ishay Telavivi August 19, 2018 at 7:13 am #

        Sorry for not being understood.

        I have explored this further and found out that Keras wrapper don’t support multi input network. That was the problem (

        So I decided to use Francesco’s comment above (January 25), using one Input for all three channels and that it’s working!

  30. Avatar
    Ishay Telavivi August 19, 2018 at 1:33 am #


    I have a queation about the nagtion words.
    When you clean your data with excluding stopwords, you are loosing the nagation words (‘not’ etc.). Don’t you want to leave these words to have better interpretation of the sentence?

    For example: “I didn’t like the movie” and “I liked the movie” would look the same after cleaning.

    • Avatar
      Jason Brownlee August 19, 2018 at 6:27 am #

      It really depends on whether they add value on the specific problem you are solving or not.

      Perhaps try modeling with and without them?

      • Avatar
        Ishay Telavivi August 19, 2018 at 7:05 am #

        Great Thanks

  31. Avatar
    Suleyman Suleymanzade September 6, 2018 at 7:18 am #

    This is two neuron networks that I tried to merge by using concatenate operation. The network should classify IMDB movies reviews by 1-good and 0-bad movies
    I have an error in model’s fit (training):
    history =[X_train, X_train], [y_train, y_train], np.reshape([y_train, y_train],(25000,2)),epochs=3, batch_size=(64,64))
    TypeError: fit() got multiple values for argument ‘batch_size’.
    This is the method that should return trained model. BTW x_train shape (25000, 5) and y_train shape (25000,)

    def cnn_lstm_merged():
    embedding_vecor_length = 32
    cnn_model = Sequential()
    cnn_model.add(Embedding(top_words, embedding_vecor_length,input_length=max_review_length))
    cnn_model.add(Conv1D(filters=32, kernel_size=3, padding=’same’, activation=’relu’))

    lstm_model = Sequential()
    lstm_model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
    lstm_model.add(LSTM(64, activation = ‘relu’, return_sequences=True))

    merge = Concatenate([lstm_model, cnn_model])
    hidden = Dense(1, activation = ‘sigmoid’)

    conc_model = Sequential()

    conc_model.compile(loss=’binary_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])
    history =[X_train, X_train], [y_train, y_train], np.reshape([y_train, y_train],(25000,2)),epochs=3, batch_size=(64,64))
    return history

    • Avatar
      Jason Brownlee September 6, 2018 at 2:09 pm #

      Sorry, I don’t have the capacity to debug your new code, perhaps post it to stackoverflow?

  32. Avatar
    Emmanuel September 14, 2018 at 2:56 pm #

    Very Nice Tutorial!

    I adapted your version to fit on my data saved in pandas’ dataframe. Since I have a limited data set, the evaluated score has reached to 100% which is a great improvement as compared to my previous, BOW model, 98%.

    You can find the complete code at my github:

    Thank you so much Jason!

    Kind Regards,

    • Avatar
      Jason Brownlee September 15, 2018 at 6:01 am #

      Well done!

    • Avatar
      sanjie December 3, 2018 at 1:40 am #

      hello Emmanuel,
      thanks for the updated version on your github, when making multi-classes fit, you should add the code

      after your code

      Thank Jason Brownlee again, i am really one of your fans.

  33. Avatar
    Astariul October 16, 2018 at 7:15 pm #

    Hi Jason,

    My first comment on this amazing blog of yours, even though I read a lot of your article, all of them useful.

    Anyway I have a question about MaxPooling layers : why do we need them at all ?

    Is it just for reducing the dimension ? Because I feel that using it will make us loose some information we learned in the Convolution layer.


    I my application I want to work with embeddings but n-grams as well. If I have the sentence ‘I like you’, I want to end up with a tensor of dimension [?, 6, d] (d is the dimension of embeddings). The tensor would represent :
    ‘I like’
    ‘like you’
    ‘I like you’

    So I want to use the basic embeddings for the 3 first token, and apply a 2-gram convolution layer to get the 2 next token, and finally a 3-gram convolution layer for the last token. Then I concatenate everything (choosing a kernel size adequately).

    In this case, why would I want to apply a MaxPooling layer ?
    Do you think my approach could work ?

    Thank you and keep up the good work 🙂

    • Avatar
      Jason Brownlee October 17, 2018 at 6:48 am #

      Yes, we use the pooling layer to distil the large feature maps down to the most essential.

      Try with and without the pooling layer and use the model that gives the best performance.

  34. Avatar
    Prem Kumar October 23, 2018 at 10:59 pm #

    Hi Jason,

    Its very much helpful for me to learn about NLP and its tasks. Thank you very much for your work.. please do the same for computer vision problems too. I can’t find a blogs like yours anywhere for computer vision problems. Please consider it..

  35. Avatar
    prisilla December 31, 2018 at 5:00 am #

    Hi Jason,

    For one of my input values when i run the CNN code
    The output i received is
    Epoch 1/100
    250/250 [==============================] – 77s 308ms/step – loss: -2.8596 – acc: 0.7843
    Epoch 2/100
    250/250 [==============================] – 76s 303ms/step – loss: -2.9239 – acc: 0.7863
    Epoch 3/100
    250/250 [==============================] – 75s 300ms/step – loss: -2.8824 – acc: 0.7904
    Epoch 4/100
    250/250 [==============================] – 74s 294ms/step – loss: -2.9322 – acc: 0.7851

    Do you think the model is correct , as the loss is in negative values.


    • Avatar
      Jason Brownlee December 31, 2018 at 6:15 am #

      Negative loss is interesting. Something odd might be going on.

  36. Avatar
    Niko January 9, 2019 at 3:46 am #

    Hello Jason,

    I wanted to apply grid search on this model (3-channels) by following your other tutorials but I’m getting this error

    >ValueError: Found input variables with inconsistent numbers of samples: [3, 8000]

    This is the code:
    # grid search
    epochs = [1, 10]
    batch_size = [16, 32]
    param_grid = dict(epochs=epochs, batch_size=batch_size)
    grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring=’accuracy’)
    grid_result =[trainX,trainX,trainX], array(trainLabels))

    I have tried googling but I didn’t find any answers for it.

    Thank you in advance!

    • Avatar
      Jason Brownlee January 9, 2019 at 8:48 am #

      Perhaps try running the grid search loops manually.

  37. Avatar
    Roi Ong February 1, 2019 at 12:39 am #

    Hi Jason! Does a prediction value close to either class mean that it has higher confidence? say 0.99 for this sentence A while 0.65 predicting this sentence B which means that the model predicts with higher confidence on sentence A compared to sentence B OR does it have anything to do with overfitting? a value produced like sentence A is due to being too closely fitted to the data which may cause erroneous predictions for the model in the future because it can prioritize sentences like sentence A?

    • Avatar
      Jason Brownlee February 1, 2019 at 5:39 am #

      A larger predicted probability could be interpreted as higher confidence in the prediction.

      • Avatar
        Roi Ong February 3, 2019 at 4:44 am #

        Thank you so much Jason!

  38. Avatar
    Asmaa M. Elmohamady February 1, 2019 at 11:15 pm #

    can i used shared input instead of using input on every CNN channel ??

    • Avatar
      Jason Brownlee February 2, 2019 at 6:17 am #

      What do you mean exactly?

      You have the data once in memory and provide multiple references to it.

  39. Avatar
    Asmaa M. Elmohamady February 5, 2019 at 12:17 am #

    sorry, I didn’t receive notification about your reply.
    I mean if i use one input with one embedding can i use it once with parallel different kernel convolution and this is also called multichannel or not?

    def define_model(length, vocab_size):
    inputs1 = Input(shape=(length,))
    embedding1 = Embedding(vocab_size, 100)(inputs1)

    # channel 1
    conv1 = Conv1D(filters=32, kernel_size=4, activation=’relu’)(embedding1)
    drop1 = Dropout(0.5)(conv1)
    pool1 = MaxPooling1D(pool_size=2)(drop1)
    flat1 = Flatten()(pool1)
    # channel 2
    conv2 = Conv1D(filters=32, kernel_size=6, activation=’relu’)(embedding1)
    drop2 = Dropout(0.5)(conv2)
    pool2 = MaxPooling1D(pool_size=2)(drop2)
    flat2 = Flatten()(pool2)
    # channel 3
    conv3 = Conv1D(filters=32, kernel_size=8, activation=’relu’)(embedding1)
    drop3 = Dropout(0.5)(conv3)
    pool3 = MaxPooling1D(pool_size=2)(drop3)
    flat3 = Flatten()(pool3)
    # merge
    merged = concatenate([flat1, flat2, flat3])
    # interpretation
    dense1 = Dense(10, activation=’relu’)(merged)
    outputs = Dense(1, activation=’sigmoid’)(dense1)
    model = Model(inputs=[inputs1, inputs2, inputs3], outputs=outputs)
    # compile
    model.compile(loss=’binary_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])
    # summarize
    plot_model(model, show_shapes=True, to_file=’multichannel.png’)
    return model

    • Avatar
      Jason Brownlee February 5, 2019 at 8:24 am #

      Yes, I have to call this a mutli-headed model and save “channel” to refer to the depth of a given input.

      • Avatar
        Asmaa M. Elmohamady February 5, 2019 at 10:37 pm #

        what do you mean with save “channel” to refer to the depth of a given input.

        • Avatar
          Jason Brownlee February 6, 2019 at 7:45 am #

          As in an input image has 3 channels, one for red/blue/green.

          And, the depth of a stack of feature maps is referred to as the channels.

  40. Avatar
    sanker February 13, 2019 at 7:50 pm #

    i am doing a project on “paraphrase detection using deep learning”.
    i have two inputs, as two sentence . both sentence want to be separate training. how i fit my model???

  41. Avatar
    Minel March 5, 2019 at 4:55 am #

    Hello jason
    There is a small typo in the beginning of the example
    table = str.maketrans(”, ”, string.punctuation)
    but it is fixed in the whole example
    table = str.maketrans(”, ”, punctuation)

    • Avatar
      Jason Brownlee March 5, 2019 at 6:41 am #

      I don’t believe so, are you sure?

      • Avatar
        Minel March 5, 2019 at 7:30 pm #

        Yes only in the first example

        # turn a doc into clean tokens
        def clean_doc(doc):
        # split into tokens by white space
        tokens = doc.split()
        # remove punctuation from each token
        table = str.maketrans(”, ”, string.punctuation)
        tokens = [w.translate(table) for w in tokens]

        • Avatar
          Jason Brownlee March 6, 2019 at 7:48 am #

          The first example in section “Loading and Cleaning Reviews” import’s string “import string” then uses “string.punctuation”.

          It is correct Python code.

          Perhaps I misunderstand your comment?

          • Avatar
            Minel March 6, 2019 at 11:05 pm #

            Hello Jason

            Yes you are right. I merged the code of two examples that is why I got a traceback
            Sorry about that

          • Avatar
            Jason Brownlee March 7, 2019 at 6:50 am #

            No problem.

  42. Avatar
    Steve March 24, 2019 at 3:00 pm #

    Hi Jason,

    I’ve followed your article and it was really helpful.
    Right now i’m stucked, the model val_loss keeps increasing while the val_acc keeps increasing as well.

    I followed your article on improving overfitting, but adding dropout layers didn’t work at all. I tried improving the amount of training data which in fact made the results even worst.

    I’ve posted a question explaining top to bottom about my problem in stackoverflow.
    It’ll be really helpful if you can take a look at the question.

    Here’s the link

    I’m looking forward for your answer.


    • Avatar
      Jason Brownlee March 25, 2019 at 6:41 am #

      Perhaps try using SGD, reducing the learning rate, and increasing the number of training epochs.

  43. Avatar
    saurabhk April 18, 2019 at 3:14 pm #

    CNNs would try learning the padded sentences directly which would result in noisy learned representations, how do we ignore padded value so it has no impact on CNN filter learning?

    • Avatar
      Jason Brownlee April 19, 2019 at 6:04 am #

      Typically CNNs ignore zero padded values, e.g. they use padding all the time as part of performing convolutions.

  44. Avatar
    SK April 19, 2019 at 10:12 pm #

    Hey Jason,

    Thanks for this easy-to-follow tutorial. I do not get any improvement with a multi-channel convnet compared to a single convnet with a 3-gram kernel and more filters (128 instead of 32) with GlobalMaxPooling. Do you have any ideas why that would be ? Can you suggest any effective tweaks to improve a multi-label text classifier ?

    Best regards

  45. Avatar
    SK April 25, 2019 at 9:37 pm #

    Hey Jason,

    Thanks for the very useful link.


  46. Avatar
    maunish June 10, 2019 at 7:04 pm #

    Hi , jason great article

    can you make a article on what are the most common ways to approach a text related Machine Learning problem

  47. Avatar
    maunish June 14, 2019 at 8:03 pm #

    I wrote a fake review as

    “this was wonderfull movie
    this movie is amazing
    actors have acted very well and performance was outstanding
    i will watch this movie again”

    then is used same model to predict if review was good or bad

    result was 0.50316983 but i was expecting more as review is straight foreword
    i have used more posotive words and avoided any confusting words

    so what is the problem here ?

    • Avatar
      Jason Brownlee June 15, 2019 at 6:32 am #

      Perhaps the model was overfit, you could try fitting it again?

  48. Avatar
    SP June 26, 2019 at 5:06 pm #

    Hi Jason,

    Thanks for your wonderful tutorial. One query: Can it support multilingual ?. I mean if the dataset in other than English, does it require any change in word embedding ?. Or still, it works.


    • Avatar
      Jason Brownlee June 27, 2019 at 7:45 am #

      You can train your own word embedding for any language as far as I know.

  49. Avatar
    SP June 27, 2019 at 8:26 pm #

    Thank you, Jason, for your response. Have you used any word embedding(ex:w2v/glove) in your code ?. I could able to see only Keras tokenizer function (correct me if I am wrong). When I ran adapting your code to another language and different dataset, it ran smoothly, so asked if it required any word embedding specifically.


  50. Avatar
    Zeinab July 7, 2019 at 12:39 am #

    What is the difference between CNN and the multichannel CNN?

    Why I need to used the multichannel CNN?

    • Avatar
      Jason Brownlee July 7, 2019 at 7:52 am #

      It can use a separate kernel size on the same input data and effectively “see” the data at different resolutions at once.

  51. Avatar
    zeinab July 7, 2019 at 11:08 pm #

    Below is a function for calculating the correlation coefficient. I use it to measure the accuracy degree for a regression problem (text similarity). I use the lstm and the multichannel cnn, however the correlation degree results with -ve values starting from the second epoch.
    can you help me and check the correctness of this function?

  52. Avatar
    zeinab July 7, 2019 at 11:10 pm #

  53. Avatar
    zeinab July 8, 2019 at 12:18 am #

    when I use the correlation coefficient as a metric,
    the resulted correlation coefficient values are ranged from -1 to -85.

    Does the resulted values means that there is a problem in the model?

  54. Avatar
    zeinab July 8, 2019 at 3:28 am #

    Can the Pearson correlation coefficient decrease and the loss at the same time? what does this means?

    I think the loss should decrease while the correlation coefficient should increase during the training time?

  55. Avatar
    zeinab July 8, 2019 at 3:50 am #

    I run the multichannel cnn model on a text similarity problem.

    Every time I run the model, I have different results(loss, accuracy).

    Does it is normal to have different loss and accuracy results every time I run the code?

  56. Avatar
    Bibin Antony October 3, 2019 at 11:30 pm #

    how to do sentiment analysis for Galssdoor data with maximum accuracy considering the ratings also ?

  57. Avatar
    zeinab October 15, 2019 at 5:54 pm #

    What is the advantage of using multichannel CNN over just one CNN?

    • Avatar
      Jason Brownlee October 16, 2019 at 7:58 am #

      It can read the input in different ways in parallel, e.g. extract different features from the same input within one model.

  58. Avatar
    moSaber December 9, 2019 at 12:49 am #

    Hi Jason,

    When I run the complete model as is, I get 50% accuracy rate (see following output). I’ve tried to add more layers to the CNN, used different kernals, tweaked the hyper parameters in the embedding layers .. etc. still getting the same accuracy rate. Would you be able to advise how I can boost the accuracy? Also, I’ve noticed that when I run the code for the first time, I get 99% accuracy rate on test data and 50% on test data.

    Max document length: 1380
    Vocabulary size: 44277
    (1800, 1380) (200, 1380)
    Train Accuracy: 50.000000
    Test Accuracy: 50.000000

  59. Avatar
    James December 10, 2019 at 5:54 am #

    Hi Jason, thanks for the tutorial. Would you be able to write a tutorial on aspect-based sentiment analysis with Keras? That would be awesome!

    • Avatar
      Jason Brownlee December 10, 2019 at 7:36 am #

      What is “aspect-based sentiment analysis”?

  60. Avatar
    Matip December 15, 2019 at 9:57 am #

    How to make a prediction with a sentence using your tutorial

  61. Avatar
    Ryan Lambert January 21, 2020 at 10:12 am #

    Hi Jason – I tweaked this to run predictions on other text documents using the above movie review data to train on. I had tried to scoring predictions on text strings such as “this movie sucks do not recommend” and got quite poor results (model predicted 65% probability it was a positive review).

    Just wondering if there are additional training resources to build into the model to get a truly rounded approach to positive/negative sentiment? It seems to work well with large paragraph/eloquent reviews but not so well with short 5-10 word reviews like “this movie was fantastic”.

    • Avatar
      Jason Brownlee January 21, 2020 at 2:52 pm #

      Nice work, and interesting finding.

      Yes, train on more diverse data or data more appropriate for the way the model is intended to be used.

  62. Avatar
    HSA February 13, 2020 at 8:58 pm #

    How about in each channel I put different embedding?
    embedding1=random embedding (channel1)
    embedding2=word2vec embedding (channel2)
    does this seem right Idea or not?

  63. Avatar
    MBD February 29, 2020 at 2:27 am #

    Hi Jason,
    Thanks for your this beautiful work. I want to ask you if we are able to implement Multichannel CNN comprises two CNN channels: word embedding and POS embedding channel. if yes how we can do that.
    Thank you for your time

    • Avatar
      Jason Brownlee February 29, 2020 at 7:16 am #

      Yes, have a separate input model for each data source.

      • Avatar
        MBD March 2, 2020 at 6:50 am #

        In this tutorial, if we want to implement this CNN with two inputs, first one word embedding and second, POS embedding channel. How we can implement that?

        • Avatar
          Jason Brownlee March 2, 2020 at 10:06 am #

          Define the two embeddings and connect them to the CNN layers, then merge the output of those layers. All with the functional API.

          The above example will provide a starting point you can adapt.

  64. Avatar
    MBD March 27, 2020 at 4:15 am #

    Hi Jason,
    Thanks for your this beautiful work. I want to ask you if we are able to use the output of the Multichannel CNN for apply LDA?

  65. Avatar
    Manish April 6, 2020 at 9:49 am #

    Hi Jason,
    Thanks for sharing this work. I was following your work to build a text classifier and noticed that you are using maxpooling of size 2. However, I believe that the original paper uses maxpooling over time (implemented as GlobalMaxPooling1D in Keras). Is there a particular reason for that?

    • Avatar
      Jason Brownlee April 6, 2020 at 10:35 am #


      Nice. No reason. Perhaps try other pooling methods and compare results?

  66. Avatar
    catalina May 5, 2020 at 5:22 am #

    Hi Jason,

    I don’t understand why you added this layer “dense1 = Dense(10, activation=’relu’)(merged)”. I can’t seem to find it in the original paper architecture.

    Can you please explain?

    • Avatar
      Jason Brownlee May 5, 2020 at 6:36 am #

      This is just a layer to interpret the merged input. You can experiment with and without it.

      • Avatar
        catalina May 5, 2020 at 7:10 am #

        Thanks for the quick reply! Is that the place in the architecture where the Kim (2014) and Zhang & Wallace (2015) place a drop-out and a l2 regularization?

  67. Avatar
    BhaviK kanekar May 24, 2020 at 9:47 pm #

    How can we apply k fold cross validation to this multi-channel CNN ? I need help for that.

  68. Avatar
    Josen Conder August 9, 2020 at 6:58 am #

    Hi Jason. Thank you for your helpful tutorial. I just wonder that: Can the model you used for classification be used for a regression task with a small change at the last dense layer: activation = linear instead of sigmoid? So for any classification model, we only need to make a small modification on the last layer to use for regression task? Is it that simple? Thanks in advance!

    • Avatar
      Jason Brownlee August 10, 2020 at 5:42 am #

      You’re welcome.

      Yes. Change the activation function in the output layer, the loss function and the metrics.

  69. Avatar
    Himanshu Dwivedi September 10, 2020 at 7:18 pm #

    Hi Jason

    I am working on 1.6 million data from kaggle for sentiment analysis.

    I kept only two fields 1: Sentiment 2: Text

    Now i want to use ANN to train my model.

    Regarding this i have few question.

    1: How to decide how many input should be there in input layer and hidden layers.

    classifier.add(Dense(units=1 , kernel_initializer=’uniform’, activation=’relu’))
    classifier.add(Dense(units=1 , kernel_initializer=’uniform’ , activation=’relu’))
    classifier.add(Dense(units=1 , kernel_initializer=’uniform’ , activation=’sigmoid’))

    I have one Input layer, one hidden layer and one o/p layer. Since i have only two fields one is dependent and one is independent so i have gave in units as 1 in input and hidden layer.

    I tried to know the shape of this, when i checked shape of my X_train it came as 100.

    Now what i kept in input is correct or do i need to give shape value in input filed.

    Please help

    • Avatar
      Jason Brownlee September 11, 2020 at 5:53 am #

      Perhaps test different numbers of inputs and discover what works best for your dataset.

  70. Avatar
    Himanshu December 28, 2020 at 1:27 am #

    Hi Jason,

    While executing the code, I am getting this error: Attribute Error: ‘str’ object has no attribute ‘decode’. How to address this.

  71. Avatar
    Slava Kostin January 2, 2021 at 10:31 am #

    Thanks for the article. I tried to improve it by using 1 input and 1 embedding layer – which reduced number of parameters:
    Total params: 5,144,937

    So – after training for 12 epochs with batch=6 I got same results:
    Epoch 12/12
    300/300 – 7s – loss: 8.2556e-06 – accuracy: 1.0000
    Test Accuracy: 87.500000

    As I understand – Embedding layer is not trainable, so I don’t need 3 of them?

  72. Avatar
    Slava Kostin January 2, 2021 at 10:41 am #

    UPD: I’ve got 90% in one run (other runs 84~88%)

    Epoch 15/15
    300/300 – 7s – loss: 2.6125e-05 – accuracy: 1.0000
    Test Accuracy: 90.499997

  73. Avatar
    Toby February 2, 2021 at 9:57 am #

    Thanks for the great breakdown. If you wanted to modify this to feed in different texts per trainX, so say you had trainX, trainY and trainZ instead. what should you do in terms of modifying the above? does it need the same tokenizer I imagine?? thanks Toby

    • Avatar
      Jason Brownlee February 2, 2021 at 1:20 pm #

      I think you’re referring to a multi-output model, if so you can adapt the model to have multiple outputs directly.

      This tutorial will help to get you started:

      • Avatar
        Toby February 3, 2021 at 10:30 pm #

        Thanks for responding, I still have one output I’m trying to predict (ie a classification example). I basically I’m trying to see if changes in my text from one document to the next can help predict my target variable. So I wanted to feed in both documents. I adapted the code above to feed in a different tokenizer, length of words, vocab to its own respective embedding matrix.

        My problem is similar to the Quora duplicate questions a little but just wanted to feed a concatenated CNN instead of an LSTM to start with. So far the validation accuracy seems OK but will test against the test set soon.

        Does this seem OK? Is this technical called a multi-channel CNN if you are feeding in two different documents?


        • Avatar
          Jason Brownlee February 4, 2021 at 6:19 am #

          Hard for me to tell sorry.

          Perhaps try it and compare results to other methods.

  74. Avatar
    Musarat Hussain October 30, 2021 at 3:08 pm #

    Hey, Jason Brownlee
    First of all, thank you very much for your wondaful article and indeed your articles are helping me more than my supervisor. Whenever, I face the problem, first, I search your article related to that problem.

    I have one question related to this post, can you please explain the cost associated to this architecture, I am trying similar idea but I have concern about the time it takes. Is it feasible for a problem that needs a quick reactive action?

    • Avatar
      Adrian Tam November 1, 2021 at 1:39 pm #

      Depends on how quick you want, but neural networks are not fast by its nature. It has much more parameters than other machine learning models.

  75. Avatar
    Musarat Hussain November 16, 2021 at 2:09 am #

    Got your point. Thank you very much. Stay Blessed.

  76. Avatar
    HS December 14, 2021 at 6:38 pm #

    Hello, Jason Brownlee
    Since a multi-channel convolutional neural network for text classification involves using multiple versions of the standard model with different sized kernels. As it is using multiple versions of standard CNN so here can we keep different number of filters along with different embedding dimensions in each channel ? For example, I want to keep 128 filters with kernel size 6 and embedding dimension 64 in channel one. Similarly I am having 64 filters with kernel size set to be 4 and embedding dimension 32 and I am doing the same for the third channel as well (I mean having different filters number, kernel size and embedding dimension). I have only one input for all three channels. Here I want that every channel convolve on the same input differently and try to find the pattern differently.
    Can you please shed light on this architecture ?
    Thank You

    • Avatar
      Adrian Tam December 15, 2021 at 7:19 am #

      I don’t think sequential model allows you to do this but you can construct one using functional API in Keras

  77. Avatar
    HS December 15, 2021 at 5:35 pm #

    Thank you so much for your quick response
    Yes, I am also talking about the same idea that has been discussed in this article using functional API in Keras. Since you have used same number of filters and same embedding dimension in each channel. So I want to explore this idea further by using different number of filters along with different kernel sizes as well as different embedding dimensions in each channel.
    One thing more that I am using same input for all the three channels and all the strings in my dataset have the fixed length that I am passing as input.
    So what will be the suitable name for this architecture? Multichannel or Mult layers ?
    Your response will be highly appreciated.

    • Avatar
      Adrian Tam December 17, 2021 at 6:52 am #

      If you have three channels, definitely multichannel. But it can also be multilayer at the same time.

      • Avatar
        HS December 17, 2021 at 5:46 pm #

        Thank you for being so cooperative

        • Avatar
          Adrian Tam December 19, 2021 at 1:33 pm #

          You’re welcomed.

Leave a Reply