How to Develop an Multichannel CNN Model for Text Classification

Last Updated on

A standard deep learning model for text classification and sentiment analysis uses a word embedding layer and one-dimensional convolutional neural network.

The model can be expanded by using multiple parallel convolutional neural networks that read the source document using different kernel sizes. This, in effect, creates a multichannel convolutional neural network for text that reads text with different n-gram sizes (groups of words).

In this tutorial, you will discover how to develop a multichannel convolutional neural network for sentiment prediction on text movie review data.

After completing this tutorial, you will know:

  • How to prepare movie review text data for modeling.
  • How to develop a multichannel convolutional neural network for text in Keras.
  • How to evaluate a fit model on unseen movie review data.

Discover how to develop deep learning models for text classification, translation, photo captioning and more in my new book, with 30 step-by-step tutorials and full source code.

Let’s get started.

  • Update Feb/2018: Small code change to reflect changes in Keras 2.1.3 API.
How to Develop an N-gram Multichannel Convolutional Neural Network for Sentiment Analysis

How to Develop an N-gram Multichannel Convolutional Neural Network for Sentiment Analysis
Photo by Ed Dunens, some rights reserved.

Tutorial Overview

This tutorial is divided into 4 parts; they are:

  1. Movie Review Dataset
  2. Data Preparation
  3. Develop Multichannel Model
  4. Evaluate Model

Python Environment

This tutorial assumes you have a Python 3 SciPy environment installed.

You must have Keras (2.0 or higher) installed with either the TensorFlow or Theano backend.

The tutorial also assumes you have scikit-learn, Pandas, NumPy, and Matplotlib installed.

If you need help with your environment, see this post:

Need help with Deep Learning for Text Data?

Take my free 7-day email crash course now (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

Start Your FREE Crash-Course Now

Movie Review Dataset

The Movie Review Data is a collection of movie reviews retrieved from the website in the early 2000s by Bo Pang and Lillian Lee. The reviews were collected and made available as part of their research on natural language processing.

The reviews were originally released in 2002, but an updated and cleaned up version was released in 2004, referred to as “v2.0”.

The dataset is comprised of 1,000 positive and 1,000 negative movie reviews drawn from an archive of the newsgroup hosted at The authors refer to this dataset as the “polarity dataset.”

Our data contains 1000 positive and 1000 negative reviews all written before 2002, with a cap of 20 reviews per author (312 authors total) per category. We refer to this corpus as the polarity dataset.

A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts, 2004.

The data has been cleaned up somewhat; for example:

  • The dataset is comprised of only English reviews.
  • All text has been converted to lowercase.
  • There is white space around punctuation like periods, commas, and brackets.
  • Text has been split into one sentence per line.

The data has been used for a few related natural language processing tasks. For classification, the performance of machine learning models (such as Support Vector Machines) on the data is in the range of high 70% to low 80% (e.g. 78%-82%).

More sophisticated data preparation may see results as high as 86% with 10-fold cross-validation. This gives us a ballpark of low-to-mid 80s if we were looking to use this dataset in experiments of modern methods.

… depending on choice of downstream polarity classifier, we can achieve highly statistically significant improvement (from 82.8% to 86.4%)

A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts, 2004.

You can download the dataset from here:

After unzipping the file, you will have a directory called “txt_sentoken” with two sub-directories containing the text “neg” and “pos” for negative and positive reviews. Reviews are stored one per file with a naming convention cv000 to cv999 for each neg and pos.

Next, let’s look at loading and preparing the text data.

Data Preparation

In this section, we will look at 3 things:

  1. Separation of data into training and test sets.
  2. Loading and cleaning the data to remove punctuation and numbers.
  3. Prepare all reviews and save to file.

Split into Train and Test Sets

We are pretending that we are developing a system that can predict the sentiment of a textual movie review as either positive or negative.

This means that after the model is developed, we will need to make predictions on new textual reviews. This will require all of the same data preparation to be performed on those new reviews as is performed on the training data for the model.

We will ensure that this constraint is built into the evaluation of our models by splitting the training and test datasets prior to any data preparation. This means that any knowledge in the data in the test set that could help us better prepare the data (e.g. the words used) is unavailable in the preparation of data used for training the model.

That being said, we will use the last 100 positive reviews and the last 100 negative reviews as a test set (100 reviews) and the remaining 1,800 reviews as the training dataset.

This is a 90% train, 10% split of the data.

The split can be imposed easily by using the filenames of the reviews where reviews named 000 to 899 are for training data and reviews named 900 onwards are for test.

Loading and Cleaning Reviews

The text data is already pretty clean; not much preparation is required.

Without getting bogged down too much by the details, we will prepare the data in the following way:

  • Split tokens on white space.
  • Remove all punctuation from words.
  • Remove all words that are not purely comprised of alphabetical characters.
  • Remove all words that are known stop words.
  • Remove all words that have a length <= 1 character.

We can put all of these steps into a function called clean_doc() that takes as an argument the raw text loaded from a file and returns a list of cleaned tokens. We can also define a function load_doc() that loads a document from file ready for use with the clean_doc() function. An example of cleaning the first positive review is listed below.

Running the example loads and cleans one movie review.

The tokens from the clean review are printed for review.

Clean All Reviews and Save

We can now use the function to clean reviews and apply it to all reviews.

To do this, we will develop a new function named process_docs() below that will walk through all reviews in a directory, clean them, and return them as a list.

We will also add an argument to the function to indicate whether the function is processing train or test reviews, that way the filenames can be filtered (as described above) and only those train or test reviews requested will be cleaned and returned.

The full function is listed below.

We can call this function with negative training reviews as follows:

Next, we need labels for the train and test documents. We know that we have 900 training documents and 100 test documents. We can use a Python list comprehension to create the labels for the negative (0) and positive (1) reviews for both train and test sets.

Finally, we want to save the prepared train and test sets to file so that we can load them later for modeling and model evaluation.

The function below-named save_dataset() will save a given prepared dataset (X and y elements) to a file using the pickle API.

Complete Example

We can tie all of these data preparation steps together.

The complete example is listed below.

Running the example cleans the text movie review documents, creates labels, and saves the prepared data for both train and test datasets in train.pkl and test.pkl respectively.

Now we are ready to develop our model.

Develop Multichannel Model

In this section, we will develop a multichannel convolutional neural network for the sentiment analysis prediction problem.

This section is divided into 3 parts:

  1. Encode Data
  2. Define Model.
  3. Complete Example.

Encode Data

The first step is to load the cleaned training dataset.

The function below-named load_dataset() can be called to load the pickled training dataset.

Next, we must fit a Keras Tokenizer on the training dataset. We will use this tokenizer to both define the vocabulary for the Embedding layer and encode the review documents as integers.

The function create_tokenizer() below will create a Tokenizer given a list of documents.

We also need to know the maximum length of input sequences as input for the model and to pad all sequences to the fixed length.

The function max_length() below will calculate the maximum length (number of words) for all reviews in the training dataset.

We also need to know the size of the vocabulary for the Embedding layer.

This can be calculated from the prepared Tokenizer, as follows:

Finally, we can integer encode and pad the clean movie review text.

The function below named encode_text() will both encode and pad text data to the maximum review length.

Define Model

A standard model for document classification is to use an Embedding layer as input, followed by a one-dimensional convolutional neural network, pooling layer, and then a prediction output layer.

The kernel size in the convolutional layer defines the number of words to consider as the convolution is passed across the input text document, providing a grouping parameter.

A multi-channel convolutional neural network for document classification involves using multiple versions of the standard model with different sized kernels. This allows the document to be processed at different resolutions or different n-grams (groups of words) at a time, whilst the model learns how to best integrate these interpretations.

This approach was first described by Yoon Kim in his 2014 paper titled “Convolutional Neural Networks for Sentence Classification.”

In the paper, Kim experimented with static and dynamic (updated) embedding layers, we can simplify the approach and instead focus only on the use of different kernel sizes.

This approach is best understood with a diagram taken from Kim’s paper:

Depiction of the multiple-channel convolutional neural network for text

Depiction of the multiple-channel convolutional neural network for text.
Taken from “Convolutional Neural Networks for Sentence Classification.”

In Keras, a multiple-input model can be defined using the functional API.

We will define a model with three input channels for processing 4-grams, 6-grams, and 8-grams of movie review text.

Each channel is comprised of the following elements:

  • Input layer that defines the length of input sequences.
  • Embedding layer set to the size of the vocabulary and 100-dimensional real-valued representations.
  • One-dimensional convolutional layer with 32 filters and a kernel size set to the number of words to read at once.
  • Max Pooling layer to consolidate the output from the convolutional layer.
  • Flatten layer to reduce the three-dimensional output to two dimensional for concatenation.

The output from the three channels are concatenated into a single vector and process by a Dense layer and an output layer.

The function below defines and returns the model. As part of defining the model, a summary of the defined model is printed and a plot of the model graph is created and saved to file.

Complete Example

Pulling all of this together, the complete example is listed below.

Running the example first prints a summary of the prepared training dataset.

Next, a summary of the defined model is printed.

The model is fit relatively quickly and appears to show good skill on the training dataset.

A plot of the defined model is saved to file, clearly showing the three input channels for the model.

Plot of the Multichannel Convolutional Neural Network For Text

Plot of the Multichannel Convolutional Neural Network For Text

The model is fit for a number of epochs and saved to the file model.h5 for later evaluation.

Evaluate Model

In this section, we can evaluate the fit model by predicting the sentiment on all reviews in the unseen test dataset.

Using the data loading functions developed in the previous section, we can load and encode both the training and test datasets.

We can load the saved model and evaluate it on both the training and test datasets.

The complete example is listed below.

Running the example prints the skill of the model on both the training and test datasets.

We can see that, as expected, the skill on the training dataset is excellent, here at 100% accuracy.

We can also see that the skill of the model on the unseen test dataset is also very impressive, achieving 87.5%, which is above the skill of the model reported in the 2014 paper (although not a direct apples-to-apples comparison).


This section lists some ideas for extending the tutorial that you may wish to explore.

  • Different n-grams. Explore the model by changing the kernel size (number of n-grams) used by the channels in the model to see how it impacts model skill.
  • More or Fewer Channels. Explore using more or fewer channels in the model and see how it impacts model skill.
  • Deeper Network. Convolutional neural networks perform better in computer vision when they are deeper. Explore using deeper models here and see how it impacts model skill.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.


In this tutorial, you discovered how to develop a multichannel convolutional neural network for sentiment prediction on text movie review data.

Specifically, you learned:

  • How to prepare movie review text data for modeling.
  • How to develop a multichannel convolutional neural network for text in Keras.
  • How to evaluate a fit model on unseen movie review data.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Develop Deep Learning models for Text Data Today!

Deep Learning for Natural Language Processing

Develop Your Own Text models in Minutes

…with just a few lines of python code

Discover how in my new Ebook:
Deep Learning for Natural Language Processing

It provides self-study tutorials on topics like:
Bag-of-Words, Word Embedding, Language Models, Caption Generation, Text Translation and much more…

Finally Bring Deep Learning to your Natural Language Processing Projects

Skip the Academics. Just Results.

Click to learn more.

153 Responses to How to Develop an Multichannel CNN Model for Text Classification

  1. Hitkul January 12, 2018 at 5:54 pm #

    Great article
    I think there is a small error, According to this code trainLines would be a list of list, where each list holds tokens for one review. But all the functions to which trainLines is passed(i.e texts_to_sequences, fit_on_texts, max_length) takes a list of strings as input. I think all the lists in trainX and testX should be converted to strings before dumping them into a file.

    • Jason Brownlee January 13, 2018 at 5:30 am #

      The list of tokens is turned back into a string:

      tokens = ' '.join(tokens)
      • Minel March 5, 2019 at 9:27 pm #

        Hello Jason
        yes but this instruction is missing in your code
        I got a traceback with this message list object has no attribute lower

        Si I add one line in clean_doc function
        tokens = [word for word in tokens if len(word) > 1]
        tokens = ‘ ‘.join(tokens)
        return tokens
        It works well

  2. Marko Plahuta January 12, 2018 at 6:41 pm #

    thanks for the article. Would it be possible to use only one input and one embedding layer, and branch into convolutions after that?

    • Jason Brownlee January 13, 2018 at 5:31 am #

      How would that work?

      • Francesco January 25, 2018 at 9:13 pm #

        Wouldn’t the model below be exactly the same? (with just one input, used in the three channels?) Of course, if we also had a single embedding (put the embedding before the channels, and let all convs act on that one) the model would be different.

        • Jason Brownlee January 26, 2018 at 5:40 am #

          Yes, nice approach.

          How does it compare regarding training/prediction?

  3. Adnan ÖNCEVARLIK January 13, 2018 at 12:31 am #

    Thank you for your effort and good clean article.

  4. Sébastien January 14, 2018 at 3:18 am #

    FYI, the link to the review polarity dataset is wrong. The correct one is:

  5. Ahmet January 14, 2018 at 3:23 am #

    Hi Jason,
    Thanks for your this beautiful work. I want to ask you that we are able to evaluate the accuracy of the model but how can we predict the classes of tested documents to analyze results in detail. In a Sequential model we can perform it like model.predict_classes(x_test). However for the Model(inputs=…) object predict_classes feature is not supported. Do you have a suggestion?

    • Jason Brownlee January 14, 2018 at 6:40 am #

      Yes, the document must be provided in an array 3 times.

      yhat = model.predict([doc, doc, doc])

      • ahmet January 14, 2018 at 7:53 am #

        Yes I also tried like that but I am not sure how to interpret the probabilities for multiclass classification.

        • Jason Brownlee January 15, 2018 at 6:54 am #

          What is the problem exactly?

          • Ahmet January 15, 2018 at 7:29 pm #

            Thanks for interest, Jason.
            I am actually trying to apply your post for a multiclass case (0:negative,1:neutral,2:positive). After training part I want to compare the accuracy rates for each class to measure the how model is accurate in detail. Then I want to calculate the Precision and Recall. However, I can not use predict_classes() function in Model() object it is just allowed for Sequential() object. When I prefer the fuction model.predict([test_doc, test_doc, test_doc]) it gives me some probabilities below but I am not sure how to map them to class labels (0,1,2).


            [0.7408592 ]

          • Jason Brownlee January 16, 2018 at 7:32 am #

            You can use argmax on the result from predict() to get a class index.

    • rango July 4, 2019 at 2:05 pm #

      hi Jason,
      I also try your post for text classification.However when I try to apply argmax() function like you said I get an array of 0.

      array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0])

      Do you Have any suggestion? Thank in advance.

      Before predict class I prepare text sample like this:

      sample = “This is sample text for classification”
      sequences = tokenizer.texts_to_sequences(sample)
      test_data = pad_sequences(sequences, maxlen=maxlen)

      predict = model.predict([test_data,test_data,test_data])

      If I do something wrong plz correct me! Thank You

  6. Rui January 14, 2018 at 4:33 am #

    I have tried similar architectures and came to the conclusion that this type of architecture (with parallel paths) are not good because when the error is back propagated one of the good paths ,that would be learning in the right way, will be affected by a bad path that will increase tha global error. So one good path would be in the right way but it will think it is a bad learning because the global error will increase . (is this ideia right ? )

    whe I have “parallel layers” I have to find a way to pretrain before … and the freeze the values and add this layers to the model.

  7. JP Raymond January 14, 2018 at 12:34 pm #

    The architecture described in Yoon Kim’s paper:
    – has one embedding layer (not one for each branch),
    – uses global (max-over-time) pooling (not with a pool of size 2),
    – applies dropout once, on the concatenation of the max features (not in each branch before the pooling operation).

    Do the changes proposed here yield a better predictive performance?

  8. Dev January 18, 2018 at 3:10 am #

    Hi Jason,

    Thanks for this article. Very helpful.

    I am facing an issue with the output, like getting different results everytime when I run the same model. The loss and accuracy are also changing all the time.

    Any thoughts on this ?

  9. Ramzy January 21, 2018 at 7:40 am #

    Hi Jason,

    Thanks for the great tutorial, I performed all the steps and in the last step when I tried to evaluate the model, I got the below results!

    Train Accuracy: 50.000000
    Test Accuracy: 50.000000
    What does this mean! I think this is not logical.

    Also I got some warning like the below while running the evaluation code.

    2018-01-20 22:21:41.333198: W c:\tf_jenkins\home\workspace\release-win\m\windows
    \py\36\tensorflow\core\platform\] The TensorFlow library
    wasn’t compiled to use AVX2 instructions, but these are available on your machin
    e and could speed up CPU computations.
    2018-01-20 22:21:41.333808: W c:\tf_jenkins\home\workspace\release-win\m\windows
    \py\36\tensorflow\core\platform\] The TensorFlow library
    wasn’t compiled to use FMA instructions, but these are available on your machine
    and could speed up CPU computations.

    I’ll appreciate your reply. Thanks in advance

  10. Mahfuza January 24, 2018 at 11:40 am #

    Hi, First of all thank you so much for the article.

    I tried to utilize the multichannel concept in a different way. I am working with relation extraction in natural language where relations are marked already but needs to figure out whether or not there is a valid sentence structure in between those entities. I am thinking to get different representations of a same sentence. For an example sentence (Paris is the capital of France), the real words of the sentence (Paris, is, the, capital, of, France) as first channel’s data, the POS tags of those same tokens (propn, verb, det, noun, adp, propn) as second channel’s data and, dependency tree tags (nsubj, root, det, attr, prep, pobj) as third channel’s data.

    I am confused whether I need to encode all the words, pos tags, and dep tags thinking them as a part of the same vocabulary? Or I need to encode those different tokens/representations in their own vocabulary scope?


    • Jason Brownlee January 24, 2018 at 4:39 pm #

      Interesting approach.

      I guess you could try a few different representations and test whether there is any impact on model skill?

  11. Mahfuza January 25, 2018 at 5:51 am #

    Actually I am confused on the text to integer encoding step. I mean, should I consider all the different representations of the same sentence as a single list of documents? For example lets say I got three different representations of the sentence “Paris is the capital of France” and collected them in a single document list like below –

    all_docs = [[Paris is the capital of France]…, [propn verb det noun adp propn]…, [nsubj root det attr prep pobj]…]

    Now if I fit the tokenizer like tokenizer.fit_on_texts(all_docs) into these documents then all the tokens are having different unique integers, right? Is it then valuable if I feed those three encoded representations separately into different channels like input1 = [ints of tokens], input2 = [more ints of tokens], input3 = [some ints of tokens]?


    I should consider fitting the tokenizer in three different representations separately and then encode them separately based on the fitted tokenizer. I think the tokenizer would provide some common integers then. Because the encoding scopes are isolated. Will that help in the concatenated layer some how?

    Thanks again Jason for your reply and effort for this blog.

  12. Francesco January 25, 2018 at 9:19 pm #

    Hi Jason,

    would it be possible, looking at the activation patterns, to identify the n-ngrams that mostly affected the classification? With pictures it is possible, but I suppose that with text it should be very difficult, with the embedding inbetween…


    • Jason Brownlee January 26, 2018 at 5:40 am #

      It may be, but some hard thinking and development would be required. It’s not obvious sorry.

  13. Massa February 1, 2018 at 7:32 am #


    Thank you for this neat article! 🙂

    I am facing an issue.

    AttributeError: ‘int’ object has no attribute ‘ndim’

    • Jason Brownlee February 2, 2018 at 8:02 am #

      Did you try coping all of the code from the final example?

    • Mia February 9, 2018 at 5:31 pm #

      I also had the same issue. The reason is that variable is a list object while Keras expects a numpy array. I solved the problem by importing numpy and inserting “trainLabels = numpy.array(trainLabels)”

      • Mia February 9, 2018 at 5:35 pm #

        i mean variable trainLabels .

        • Jason Brownlee February 14, 2018 at 11:03 am #

          I have update the example to correct this issue.

          It seems to be a change in Keras 2.1.3.

  14. JanneK February 5, 2018 at 5:44 am #

    Thank for this useful post Jason. It’s interesting that even such a minimalistic preprocessing always seem to work fine. When doing text classification (or regression) tasks, do you generally see any value in keeping/including additional information during preprocessing, such as:

    (1) Basic sentence-separating punctuations, such as “.”, “!” and “?”
    (2) Paragraph separators between groups of sentences
    (3) POS tag augmented tokens, e.g., “walking” –> “walking_VERB”, “apple” –> “apple_NOUN”

    Do these typically have any significant impact?

    • Jason Brownlee February 5, 2018 at 7:53 am #

      It really depends on the application I’m afraid.

      For simple classification like tasks, often simpler representations result in better model skill.

  15. Jason H February 7, 2018 at 7:13 am #

    You should be able to comment out inputs2, embedding2, inputs3 and embedding3, and then feed embedding1 to conv2 and conv3, right? Then you go from [x_in, x_in, x_in] to just x_in for fit(), and predict().

  16. Franco Arda February 14, 2018 at 7:01 pm #

    Hi Jason,

    In your NLP book (Chapter 16), you cover n-gram CNN in depth. LSTM’s memory mimic n-grams.

    May I ask you why you not use a LSTM classification instead?


    • Jason Brownlee February 15, 2018 at 8:39 am #

      You can use LSTMs, here I was demonstrating how to do it with CNNs.

      • Franco February 15, 2018 at 8:16 pm #

        Thanks Jason!

        Indeed, you have a NLP sample in your LSTM book.

        • Boris February 21, 2018 at 2:15 am #

          Hi Jason,

          Thanks for yet another amazing post. Apart from purely educational purposes, what do you think would be the differences between using LSTM and CNN in sentiment analysis? Are both approaches completely interchangeable or each of the models might hold advantages/limitations in different setups.


          • Jason Brownlee February 21, 2018 at 6:41 am #

            CNN seems to achieve state of the art results. I would start there.

  17. Satheesh March 24, 2018 at 10:25 pm #

    Can we apply this approach to multi class problems? The only change is during fitting and using ‘categorical cross entropy” as loss.Am I correct

  18. Marwa May 5, 2018 at 3:24 am #

    Hi Jason,
    I am beginner on keras and CNN ,I want to know How to give multinputs to train_test_split?

    my example:(Xtrain_user,Xtrain_item,Xtest_user,Xtest_item),y_train,y_test=train_test_split((user_reviews,
    item_reviews),rating , test_size=0.2, random_state=42)

    which X= (user_reviews,item_reviews)
    and Y=rating

  19. Junaid Akhtar May 11, 2018 at 3:19 pm #

    First of all, thank you for what you did here, and generally what you do 🙂

    I tested your code on the imdb data, got the similar results, then changed the output to 2 neurons with softmax/argmax training/testing, got marginally better results.

    Then I switched to my own 5-class sentiment dataset, but got 98%-ish traning accuracy but 40%-ish testing accuracy, which means its over-fitting, but then read in many papers that generally in the 5-grained SST1 dataset everyone reports an accuracy of 40%-ish. With binary +/- sentiments of course everyone reaches late 80s-90s in accuracy.

    What do you suggest to that? Or a setting I could try on top of your code.

    • Jason Brownlee May 12, 2018 at 6:26 am #

      The model architecture may require tuning to each different problem.

  20. Ethan Schmit May 31, 2018 at 7:53 pm #

    Hi Jason ,

    Thanks for the great article.

    If I have more than 2 classes, e.g. positive/negative/neutral, what will my output list (trainLabels) look like? Will it be a a list containing 0’s, 1’s and 2’s or will it be a n*3 matrix, with each column containing 0’s and 1’s?

    Thanks a lot

  21. az June 28, 2018 at 4:01 am #

    i have created a network which takes two sequences of integers (2 inputs) for one sentence (one related to word embeddings and other related to POS tags) and corresponding embedding layers and then merges them both together before applying the convolution layer. I had called it multi input model.

    inp = Input(shape=(data.shape[1],))
    e= Embedding(vocab_size, 200, input_length=maxSeqLength, weights=[embedding_matrix], trainable=False)(inp)
    inp2 = Input(shape=(embedding_pos.shape[1],))
    e1= Embedding(54, 54, input_length=maxSeqLength, weights=[embedding_matrix_pos], trainable=False)(inp2)


    conv1_1 = Conv1D(filters=100, kernel_size=3)(mer)
    conv1_2 = Conv1D(filters=100, kernel_size=4)(mer)

    From this article i understood that it might be multichannel. I am confused that is multiple inputs is necessarily multichannel or multiple parallel convolution layers even on same input is multichannel? If i consider the second definition i would call my network multi-input multichannel?
    Also in this case all the convolutions are applied to original input.What is the difference if i stack up convolutions so the result of one is input to another?Thanks.

    • Jason Brownlee June 28, 2018 at 6:25 am #

      If the inputs are different, it might be better to have a multi-headed CNN rather than a multichannel CNN.

  22. az June 28, 2018 at 8:51 am #

    The inputs are sequence of the same sentence and both are padded to same would it be called different inputs still? and can you please clarify if multichannel is multi-input regardless of parallel or sequential convolutions?

    • Jason Brownlee June 28, 2018 at 2:07 pm #

      Each head of the model will “read” the data differently.

      Perhaps referring to the multiple inputs to the model as “channels” was a poor choice. Different inputs as in the above model are indeed different to different channels for one input. Importantly, we can use different sized kernels and number of filters with different inputs, whereas each channel on one input is fixed to use the same kernel and number of filters.

    • az July 12, 2018 at 4:45 am #

      Thanks alot. I understand that now that multichannel means one input with one or more embeddings for that one input.In your example you are using input multiple times but still its the same input. I have taken your advice and applied convolution of different filter size to each embedding separately instead of merging them together which has improved accuracy. Only one thing,it means it is multi input, i can understand how it can be seen as multi basically having multiple inputs makes it multiheaded?Thanks

      • Jason Brownlee July 12, 2018 at 6:30 am #


        • az July 28, 2018 at 4:11 am #

          Thank you so much! The precision and recall graphs i plotted through training epochs show sudden spikes in precision while gradual increase in recall.Though recall is higher than precision. Now my dataset has more positive samples than negative ones which lead me to believe that there is chance higher number of FN than FP i.e. lower recall and higher precision but my results are opposite. Can you please elaborate if i have the wrong understanding?

          • az July 28, 2018 at 4:44 am #

            or should i see it this way that since it has more positive samples,classifier is biased towards positive class,leading to more FP and hence lower precision and higher recall?Thanks

          • Jason Brownlee July 28, 2018 at 6:40 am #

            You will need to find a trade-off that makes sense for your specific application and problem.

  23. Yuval July 11, 2018 at 11:20 pm #

    I plotted precision-recall curve for both pos and neg class and found the results interesting,
    While the curve for the pos class looks very good with oprtimal point at 0.9 and 0.8 for recall and precision respecitvally. The curve for the neg class is a stright line that gives 0.6 for precision and recall or 0.8 with 0.6 and vice versa.
    Any idea how this could be?

    • Jason Brownlee July 12, 2018 at 6:25 am #

      It may suggest that one class is easier to predict than the other.

  24. Zenon Uchida July 22, 2018 at 9:39 pm #

    I did the exact thing on the tutorial but i got a lower test accuracy which is 83.5%. Traning is 100%

    • Jason Brownlee July 23, 2018 at 6:09 am #

      You could try running the example a few times to see if you get differing results.

      • Zenon Uchida August 5, 2018 at 2:45 am #

        (forgot to reply) Yes, actually got different results after running it a few times.

  25. andi wijaya July 28, 2018 at 2:03 am #

    def max_length(lines):
    return max([len(s.split()) for s in lines])
    I just copy and paste the code and have this error, I believe that code is actually right, I dont know why I got an error
    AttributeError: ‘list’ object has no attribute ‘split’

    Python 3.6
    Tensorflow 1.9
    Keras 2.2

  26. Zenon Uchida August 4, 2018 at 11:59 pm #

    I have a question on the “Plot of the Multichannel Convolutional Neural Network For Text”.
    Why is there always a “none” per box? What does it denote to?

    • Zenon Uchida August 5, 2018 at 2:02 am #

      Found the answer: The None dimension in the shape tuple refers to the batch dimension which simply means that the layer can accept input of any size.

    • Jason Brownlee August 5, 2018 at 5:31 am #

      Good question!

      “None” refers to a dimension that is not specified, and in turn is variable.

  27. Zenon Uchida August 5, 2018 at 2:50 am #

    I have a question. Based on this paper, (A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification), they stated “we set the number of feature maps for this region size to 100”. How do i set the number of feature maps? Is the number of feature maps based on the number of filters per region size?

    • Jason Brownlee August 5, 2018 at 5:35 am #

      The number of feature maps is the first argument to the convolutional layer Conv1D().

  28. Zenon Uchida August 17, 2018 at 1:39 am #

    I really got lots of questions sorry xD.

    What’s an intuitive way of understanding how this multi-channeled CNN (or perhaps a basic CNN architecture in general) identify the following (given that the sentiment expressed are only positive or negative):
    1.) If more than one sentiment is expressed in the tweet but the positive sentiment is more dominant, then it is a positive tweet, or
    2.) If more than one sentiment is expressed in the tweet but the negative sentiment is more dominant, then it is a negative tweet.

    • Jason Brownlee August 17, 2018 at 6:34 am #

      We cannot know what the CNN has learned, only the evaluation of the model performance.

      • Zenon Uchida August 17, 2018 at 3:40 pm #

        What I’m trying to understand here is how CNN is applied to sentence classification. I still don’t get the idea. What i know is CNN when it comes to computer vision, it finds features of pictures. For example a dog with features such ears, nose and eyes.

        • Zenon Uchida August 17, 2018 at 4:56 pm #

          I think i got.
          In convolutional neural networks every network layer acts as a detection filter for the presence of specific features present in the original data. The first layer in a CNN detect (large) features that can be recognized and interpreted relatively easy. Subsequent layers detect increasingly (smaller) features that are more abstract. The last layer of the CNN is able to make an ultra-specific classification by combining all the specific features detected by the previous layers in the input data.

  29. Ishay Telavivi August 19, 2018 at 1:23 am #


    I have a question about the input of the model, based on an error I got.

    I ran this code on a different data and I had no problems. Then I tried an extention with RandomizedSearchCV, as the following:

    model = KerasClassifier(build_fn=create_model, verbose=1, epochs=3, batch_size=32)
    param_dist= {“n_strides”: sp_randint(1,3)}
    random_grid = RandomizedSearchCV(estimator=model, param_distributions=param_dist, n_iter = 3)
    random_grid_result =[X_train, X_train, X_train], y_train)

    I got the following error:
    ValueError: Found input variables with inconsistent numbers of samples: [3, 25000]

    What may the error be? 25000 is the length of my train data

    • Jason Brownlee August 19, 2018 at 6:26 am #

      Sorry, I’m not sure what is going on.

      Perhaps post your code and error to stackoverflow?

      • Ishay Telavivi August 19, 2018 at 7:13 am #

        Sorry for not being understood.

        I have explored this further and found out that Keras wrapper don’t support multi input network. That was the problem (

        So I decided to use Francesco’s comment above (January 25), using one Input for all three channels and that it’s working!

  30. Ishay Telavivi August 19, 2018 at 1:33 am #


    I have a queation about the nagtion words.
    When you clean your data with excluding stopwords, you are loosing the nagation words (‘not’ etc.). Don’t you want to leave these words to have better interpretation of the sentence?

    For example: “I didn’t like the movie” and “I liked the movie” would look the same after cleaning.

    • Jason Brownlee August 19, 2018 at 6:27 am #

      It really depends on whether they add value on the specific problem you are solving or not.

      Perhaps try modeling with and without them?

      • Ishay Telavivi August 19, 2018 at 7:05 am #

        Great Thanks

  31. Suleyman Suleymanzade September 6, 2018 at 7:18 am #

    This is two neuron networks that I tried to merge by using concatenate operation. The network should classify IMDB movies reviews by 1-good and 0-bad movies
    I have an error in model’s fit (training):
    history =[X_train, X_train], [y_train, y_train], np.reshape([y_train, y_train],(25000,2)),epochs=3, batch_size=(64,64))
    TypeError: fit() got multiple values for argument ‘batch_size’.
    This is the method that should return trained model. BTW x_train shape (25000, 5) and y_train shape (25000,)

    def cnn_lstm_merged():
    embedding_vecor_length = 32
    cnn_model = Sequential()
    cnn_model.add(Embedding(top_words, embedding_vecor_length,input_length=max_review_length))
    cnn_model.add(Conv1D(filters=32, kernel_size=3, padding=’same’, activation=’relu’))

    lstm_model = Sequential()
    lstm_model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
    lstm_model.add(LSTM(64, activation = ‘relu’, return_sequences=True))

    merge = Concatenate([lstm_model, cnn_model])
    hidden = Dense(1, activation = ‘sigmoid’)

    conc_model = Sequential()

    conc_model.compile(loss=’binary_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])
    history =[X_train, X_train], [y_train, y_train], np.reshape([y_train, y_train],(25000,2)),epochs=3, batch_size=(64,64))
    return history

    • Jason Brownlee September 6, 2018 at 2:09 pm #

      Sorry, I don’t have the capacity to debug your new code, perhaps post it to stackoverflow?

  32. Emmanuel September 14, 2018 at 2:56 pm #

    Very Nice Tutorial!

    I adapted your version to fit on my data saved in pandas’ dataframe. Since I have a limited data set, the evaluated score has reached to 100% which is a great improvement as compared to my previous, BOW model, 98%.

    You can find the complete code at my github:

    Thank you so much Jason!

    Kind Regards,

    • Jason Brownlee September 15, 2018 at 6:01 am #

      Well done!

    • sanjie December 3, 2018 at 1:40 am #

      hello Emmanuel,
      thanks for the updated version on your github, when making multi-classes fit, you should add the code

      after your code

      Thank Jason Brownlee again, i am really one of your fans.

  33. Astariul October 16, 2018 at 7:15 pm #

    Hi Jason,

    My first comment on this amazing blog of yours, even though I read a lot of your article, all of them useful.

    Anyway I have a question about MaxPooling layers : why do we need them at all ?

    Is it just for reducing the dimension ? Because I feel that using it will make us loose some information we learned in the Convolution layer.


    I my application I want to work with embeddings but n-grams as well. If I have the sentence ‘I like you’, I want to end up with a tensor of dimension [?, 6, d] (d is the dimension of embeddings). The tensor would represent :
    ‘I like’
    ‘like you’
    ‘I like you’

    So I want to use the basic embeddings for the 3 first token, and apply a 2-gram convolution layer to get the 2 next token, and finally a 3-gram convolution layer for the last token. Then I concatenate everything (choosing a kernel size adequately).

    In this case, why would I want to apply a MaxPooling layer ?
    Do you think my approach could work ?

    Thank you and keep up the good work 🙂

    • Jason Brownlee October 17, 2018 at 6:48 am #

      Yes, we use the pooling layer to distil the large feature maps down to the most essential.

      Try with and without the pooling layer and use the model that gives the best performance.

  34. Prem Kumar October 23, 2018 at 10:59 pm #

    Hi Jason,

    Its very much helpful for me to learn about NLP and its tasks. Thank you very much for your work.. please do the same for computer vision problems too. I can’t find a blogs like yours anywhere for computer vision problems. Please consider it..

  35. prisilla December 31, 2018 at 5:00 am #

    Hi Jason,

    For one of my input values when i run the CNN code
    The output i received is
    Epoch 1/100
    250/250 [==============================] – 77s 308ms/step – loss: -2.8596 – acc: 0.7843
    Epoch 2/100
    250/250 [==============================] – 76s 303ms/step – loss: -2.9239 – acc: 0.7863
    Epoch 3/100
    250/250 [==============================] – 75s 300ms/step – loss: -2.8824 – acc: 0.7904
    Epoch 4/100
    250/250 [==============================] – 74s 294ms/step – loss: -2.9322 – acc: 0.7851

    Do you think the model is correct , as the loss is in negative values.


    • Jason Brownlee December 31, 2018 at 6:15 am #

      Negative loss is interesting. Something odd might be going on.

  36. Niko January 9, 2019 at 3:46 am #

    Hello Jason,

    I wanted to apply grid search on this model (3-channels) by following your other tutorials but I’m getting this error

    >ValueError: Found input variables with inconsistent numbers of samples: [3, 8000]

    This is the code:
    # grid search
    epochs = [1, 10]
    batch_size = [16, 32]
    param_grid = dict(epochs=epochs, batch_size=batch_size)
    grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring=’accuracy’)
    grid_result =[trainX,trainX,trainX], array(trainLabels))

    I have tried googling but I didn’t find any answers for it.

    Thank you in advance!

    • Jason Brownlee January 9, 2019 at 8:48 am #

      Perhaps try running the grid search loops manually.

  37. Roi Ong February 1, 2019 at 12:39 am #

    Hi Jason! Does a prediction value close to either class mean that it has higher confidence? say 0.99 for this sentence A while 0.65 predicting this sentence B which means that the model predicts with higher confidence on sentence A compared to sentence B OR does it have anything to do with overfitting? a value produced like sentence A is due to being too closely fitted to the data which may cause erroneous predictions for the model in the future because it can prioritize sentences like sentence A?

    • Jason Brownlee February 1, 2019 at 5:39 am #

      A larger predicted probability could be interpreted as higher confidence in the prediction.

      • Roi Ong February 3, 2019 at 4:44 am #

        Thank you so much Jason!

  38. Asmaa M. Elmohamady February 1, 2019 at 11:15 pm #

    can i used shared input instead of using input on every CNN channel ??

    • Jason Brownlee February 2, 2019 at 6:17 am #

      What do you mean exactly?

      You have the data once in memory and provide multiple references to it.

  39. Asmaa M. Elmohamady February 5, 2019 at 12:17 am #

    sorry, I didn’t receive notification about your reply.
    I mean if i use one input with one embedding can i use it once with parallel different kernel convolution and this is also called multichannel or not?

    def define_model(length, vocab_size):
    inputs1 = Input(shape=(length,))
    embedding1 = Embedding(vocab_size, 100)(inputs1)

    # channel 1
    conv1 = Conv1D(filters=32, kernel_size=4, activation=’relu’)(embedding1)
    drop1 = Dropout(0.5)(conv1)
    pool1 = MaxPooling1D(pool_size=2)(drop1)
    flat1 = Flatten()(pool1)
    # channel 2
    conv2 = Conv1D(filters=32, kernel_size=6, activation=’relu’)(embedding1)
    drop2 = Dropout(0.5)(conv2)
    pool2 = MaxPooling1D(pool_size=2)(drop2)
    flat2 = Flatten()(pool2)
    # channel 3
    conv3 = Conv1D(filters=32, kernel_size=8, activation=’relu’)(embedding1)
    drop3 = Dropout(0.5)(conv3)
    pool3 = MaxPooling1D(pool_size=2)(drop3)
    flat3 = Flatten()(pool3)
    # merge
    merged = concatenate([flat1, flat2, flat3])
    # interpretation
    dense1 = Dense(10, activation=’relu’)(merged)
    outputs = Dense(1, activation=’sigmoid’)(dense1)
    model = Model(inputs=[inputs1, inputs2, inputs3], outputs=outputs)
    # compile
    model.compile(loss=’binary_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])
    # summarize
    plot_model(model, show_shapes=True, to_file=’multichannel.png’)
    return model

    • Jason Brownlee February 5, 2019 at 8:24 am #

      Yes, I have to call this a mutli-headed model and save “channel” to refer to the depth of a given input.

      • Asmaa M. Elmohamady February 5, 2019 at 10:37 pm #

        what do you mean with save “channel” to refer to the depth of a given input.

        • Jason Brownlee February 6, 2019 at 7:45 am #

          As in an input image has 3 channels, one for red/blue/green.

          And, the depth of a stack of feature maps is referred to as the channels.

  40. sanker February 13, 2019 at 7:50 pm #

    i am doing a project on “paraphrase detection using deep learning”.
    i have two inputs, as two sentence . both sentence want to be separate training. how i fit my model???

  41. Minel March 5, 2019 at 4:55 am #

    Hello jason
    There is a small typo in the beginning of the example
    table = str.maketrans(”, ”, string.punctuation)
    but it is fixed in the whole example
    table = str.maketrans(”, ”, punctuation)

    • Jason Brownlee March 5, 2019 at 6:41 am #

      I don’t believe so, are you sure?

      • Minel March 5, 2019 at 7:30 pm #

        Yes only in the first example

        # turn a doc into clean tokens
        def clean_doc(doc):
        # split into tokens by white space
        tokens = doc.split()
        # remove punctuation from each token
        table = str.maketrans(”, ”, string.punctuation)
        tokens = [w.translate(table) for w in tokens]

        • Jason Brownlee March 6, 2019 at 7:48 am #

          The first example in section “Loading and Cleaning Reviews” import’s string “import string” then uses “string.punctuation”.

          It is correct Python code.

          Perhaps I misunderstand your comment?

          • Minel March 6, 2019 at 11:05 pm #

            Hello Jason

            Yes you are right. I merged the code of two examples that is why I got a traceback
            Sorry about that

          • Jason Brownlee March 7, 2019 at 6:50 am #

            No problem.

  42. Steve March 24, 2019 at 3:00 pm #

    Hi Jason,

    I’ve followed your article and it was really helpful.
    Right now i’m stucked, the model val_loss keeps increasing while the val_acc keeps increasing as well.

    I followed your article on improving overfitting, but adding dropout layers didn’t work at all. I tried improving the amount of training data which in fact made the results even worst.

    I’ve posted a question explaining top to bottom about my problem in stackoverflow.
    It’ll be really helpful if you can take a look at the question.

    Here’s the link

    I’m looking forward for your answer.


    • Jason Brownlee March 25, 2019 at 6:41 am #

      Perhaps try using SGD, reducing the learning rate, and increasing the number of training epochs.

  43. saurabhk April 18, 2019 at 3:14 pm #

    CNNs would try learning the padded sentences directly which would result in noisy learned representations, how do we ignore padded value so it has no impact on CNN filter learning?

    • Jason Brownlee April 19, 2019 at 6:04 am #

      Typically CNNs ignore zero padded values, e.g. they use padding all the time as part of performing convolutions.

  44. SK April 19, 2019 at 10:12 pm #

    Hey Jason,

    Thanks for this easy-to-follow tutorial. I do not get any improvement with a multi-channel convnet compared to a single convnet with a 3-gram kernel and more filters (128 instead of 32) with GlobalMaxPooling. Do you have any ideas why that would be ? Can you suggest any effective tweaks to improve a multi-label text classifier ?

    Best regards

  45. SK April 25, 2019 at 9:37 pm #

    Hey Jason,

    Thanks for the very useful link.


  46. maunish June 10, 2019 at 7:04 pm #

    Hi , jason great article

    can you make a article on what are the most common ways to approach a text related Machine Learning problem

  47. maunish June 14, 2019 at 8:03 pm #

    I wrote a fake review as

    “this was wonderfull movie
    this movie is amazing
    actors have acted very well and performance was outstanding
    i will watch this movie again”

    then is used same model to predict if review was good or bad

    result was 0.50316983 but i was expecting more as review is straight foreword
    i have used more posotive words and avoided any confusting words

    so what is the problem here ?

    • Jason Brownlee June 15, 2019 at 6:32 am #

      Perhaps the model was overfit, you could try fitting it again?

  48. SP June 26, 2019 at 5:06 pm #

    Hi Jason,

    Thanks for your wonderful tutorial. One query: Can it support multilingual ?. I mean if the dataset in other than English, does it require any change in word embedding ?. Or still, it works.


    • Jason Brownlee June 27, 2019 at 7:45 am #

      You can train your own word embedding for any language as far as I know.

  49. SP June 27, 2019 at 8:26 pm #

    Thank you, Jason, for your response. Have you used any word embedding(ex:w2v/glove) in your code ?. I could able to see only Keras tokenizer function (correct me if I am wrong). When I ran adapting your code to another language and different dataset, it ran smoothly, so asked if it required any word embedding specifically.


  50. Zeinab July 7, 2019 at 12:39 am #

    What is the difference between CNN and the multichannel CNN?

    Why I need to used the multichannel CNN?

    • Jason Brownlee July 7, 2019 at 7:52 am #

      It can use a separate kernel size on the same input data and effectively “see” the data at different resolutions at once.

  51. zeinab July 7, 2019 at 11:08 pm #

    Below is a function for calculating the correlation coefficient. I use it to measure the accuracy degree for a regression problem (text similarity). I use the lstm and the multichannel cnn, however the correlation degree results with -ve values starting from the second epoch.
    can you help me and check the correctness of this function?

  52. zeinab July 7, 2019 at 11:10 pm #

  53. zeinab July 8, 2019 at 12:18 am #

    when I use the correlation coefficient as a metric,
    the resulted correlation coefficient values are ranged from -1 to -85.

    Does the resulted values means that there is a problem in the model?

  54. zeinab July 8, 2019 at 3:28 am #

    Can the Pearson correlation coefficient decrease and the loss at the same time? what does this means?

    I think the loss should decrease while the correlation coefficient should increase during the training time?

  55. zeinab July 8, 2019 at 3:50 am #

    I run the multichannel cnn model on a text similarity problem.

    Every time I run the model, I have different results(loss, accuracy).

    Does it is normal to have different loss and accuracy results every time I run the code?

Leave a Reply