How to Develop a Deep Learning Photo Caption Generator from Scratch

Develop a Deep Learning Model to Automatically
Describe Photographs in Python with Keras, Step-by-Step.

Caption generation is a challenging artificial intelligence problem where a textual description must be generated for a given photograph.

It requires both methods from computer vision to understand the content of the image and a language model from the field of natural language processing to turn the understanding of the image into words in the right order. Recently, deep learning methods have achieved state-of-the-art results on examples of this problem.

Deep learning methods have demonstrated state-of-the-art results on caption generation problems. What is most impressive about these methods is a single end-to-end model can be defined to predict a caption, given a photo, instead of requiring sophisticated data preparation or a pipeline of specifically designed models.

In this tutorial, you will discover how to develop a photo captioning deep learning model from scratch.

After completing this tutorial, you will know:

  • How to prepare photo and text data for training a deep learning model.
  • How to design and train a deep learning caption generation model.
  • How to evaluate a train caption generation model and use it to caption entirely new photographs.

Let’s get started.

  • Update Nov/2017: Added note about a bug introduced in Keras 2.1.0 and 2.1.1 that impacts the code in this tutorial.
  • Update Dec/2017: Updated a typo in the function name when explaining how to save descriptions to file, thanks Minel.
  • Update Apr/2018: Added a new section that shows how to train the model using progressive loading for workstations with minimum RAM.
How to Develop a Deep Learning Caption Generation Model in Python from Scratch

How to Develop a Deep Learning Caption Generation Model in Python from Scratch
Photo by Living in Monrovia, some rights reserved.

Tutorial Overview

This tutorial is divided into 6 parts; they are:

  1. Photo and Caption Dataset
  2. Prepare Photo Data
  3. Prepare Text Data
  4. Develop Deep Learning Model
  5. Train With Progressive Loading (NEW)
  6. Evaluate Model
  7. Generate New Captions

Python Environment

This tutorial assumes you have a Python SciPy environment installed, ideally with Python 3.

You must have Keras (2.1.5 or higher) installed with either the TensorFlow or Theano backend.

The tutorial also assumes you have scikit-learn, Pandas, NumPy, and Matplotlib installed.

If you need help with your environment, see this tutorial:

I recommend running the code on a system with a GPU. You can access GPUs cheaply on Amazon Web Services. Learn how in this tutorial:

Let’s dive in.

Need help with Deep Learning for Text Data?

Take my free 7-day email crash course now (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

Start Your FREE Crash-Course Now

Photo and Caption Dataset

A good dataset to use when getting started with image captioning is the Flickr8K dataset.

The reason is because it is realistic and relatively small so that you can download it and build models on your workstation using a CPU.

The definitive description of the dataset is in the paper “Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics” from 2013.

The authors describe the dataset as follows:

We introduce a new benchmark collection for sentence-based image description and search, consisting of 8,000 images that are each paired with five different captions which provide clear descriptions of the salient entities and events.

The images were chosen from six different Flickr groups, and tend not to contain any well-known people or locations, but were manually selected to depict a variety of scenes and situations.

Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics, 2013.

The dataset is available for free. You must complete a request form and the links to the dataset will be emailed to you. I would love to link to them for you, but the email address expressly requests: “Please do not redistribute the dataset“.

You can use the link below to request the dataset:

Within a short time, you will receive an email that contains links to two files:

  • (1 Gigabyte) An archive of all photographs.
  • (2.2 Megabytes) An archive of all text descriptions for photographs.

Download the datasets and unzip them into your current working directory. You will have two directories:

  • Flicker8k_Dataset: Contains 8092 photographs in JPEG format.
  • Flickr8k_text: Contains a number of files containing different sources of descriptions for the photographs.

The dataset has a pre-defined training dataset (6,000 images), development dataset (1,000 images), and test dataset (1,000 images).

One measure that can be used to evaluate the skill of the model are BLEU scores. For reference, below are some ball-park BLEU scores for skillful models when evaluated on the test dataset (taken from the 2017 paper “Where to put the Image in an Image Caption Generator“):

  • BLEU-1: 0.401 to 0.578.
  • BLEU-2: 0.176 to 0.390.
  • BLEU-3: 0.099 to 0.260.
  • BLEU-4: 0.059 to 0.170.

We describe the BLEU metric more later when we work on evaluating our model.

Next, let’s look at how to load the images.

Prepare Photo Data

We will use a pre-trained model to interpret the content of the photos.

There are many models to choose from. In this case, we will use the Oxford Visual Geometry Group, or VGG, model that won the ImageNet competition in 2014. Learn more about the model here:

Keras provides this pre-trained model directly. Note, the first time you use this model, Keras will download the model weights from the Internet, which are about 500 Megabytes. This may take a few minutes depending on your internet connection.

We could use this model as part of a broader image caption model. The problem is, it is a large model and running each photo through the network every time we want to test a new language model configuration (downstream) is redundant.

Instead, we can pre-compute the “photo features” using the pre-trained model and save them to file. We can then load these features later and feed them into our model as the interpretation of a given photo in the dataset. It is no different to running the photo through the full VGG model; it is just we will have done it once in advance.

This is an optimization that will make training our models faster and consume less memory.

We can load the VGG model in Keras using the VGG class. We will remove the last layer from the loaded model, as this is the model used to predict a classification for a photo. We are not interested in classifying images, but we are interested in the internal representation of the photo right before a classification is made. These are the “features” that the model has extracted from the photo.

Keras also provides tools for reshaping the loaded photo into the preferred size for the model (e.g. 3 channel 224 x 224 pixel image).

Below is a function named extract_features() that, given a directory name, will load each photo, prepare it for VGG, and collect the predicted features from the VGG model. The image features are a 1-dimensional 4,096 element vector.

The function returns a dictionary of image identifier to image features.

We can call this function to prepare the photo data for testing our models, then save the resulting dictionary to a file named ‘features.pkl‘.

The complete example is listed below.

Running this data preparation step may take a while depending on your hardware, perhaps one hour on the CPU with a modern workstation.

At the end of the run, you will have the extracted features stored in ‘features.pkl‘ for later use. This file will be about 127 Megabytes in size.

Prepare Text Data

The dataset contains multiple descriptions for each photograph and the text of the descriptions requires some minimal cleaning.

First, we will load the file containing all of the descriptions.

Each photo has a unique identifier. This identifier is used on the photo filename and in the text file of descriptions.

Next, we will step through the list of photo descriptions. Below defines a function load_descriptions() that, given the loaded document text, will return a dictionary of photo identifiers to descriptions. Each photo identifier maps to a list of one or more textual descriptions.

Next, we need to clean the description text. The descriptions are already tokenized and easy to work with.

We will clean the text in the following ways in order to reduce the size of the vocabulary of words we will need to work with:

  • Convert all words to lowercase.
  • Remove all punctuation.
  • Remove all words that are one character or less in length (e.g. ‘a’).
  • Remove all words with numbers in them.

Below defines the clean_descriptions() function that, given the dictionary of image identifiers to descriptions, steps through each description and cleans the text.

Once cleaned, we can summarize the size of the vocabulary.

Ideally, we want a vocabulary that is both expressive and as small as possible. A smaller vocabulary will result in a smaller model that will train faster.

For reference, we can transform the clean descriptions into a set and print its size to get an idea of the size of our dataset vocabulary.

Finally, we can save the dictionary of image identifiers and descriptions to a new file named descriptions.txt, with one image identifier and description per line.

Below defines the save_descriptions() function that, given a dictionary containing the mapping of identifiers to descriptions and a filename, saves the mapping to file.

Putting this all together, the complete listing is provided below.

Running the example first prints the number of loaded photo descriptions (8,092) and the size of the clean vocabulary (8,763 words).

Finally, the clean descriptions are written to ‘descriptions.txt‘.

Taking a look at the file, we can see that the descriptions are ready for modeling. The order of descriptions in your file may vary.

Develop Deep Learning Model

In this section, we will define the deep learning model and fit it on the training dataset.

This section is divided into the following parts:

  1. Loading Data.
  2. Defining the Model.
  3. Fitting the Model.
  4. Complete Example.

Loading Data

First, we must load the prepared photo and text data so that we can use it to fit the model.

We are going to train the data on all of the photos and captions in the training dataset. While training, we are going to monitor the performance of the model on the development dataset and use that performance to decide when to save models to file.

The train and development dataset have been predefined in the Flickr_8k.trainImages.txt and Flickr_8k.devImages.txt files respectively, that both contain lists of photo file names. From these file names, we can extract the photo identifiers and use these identifiers to filter photos and descriptions for each set.

The function load_set() below will load a pre-defined set of identifiers given the train or development sets filename.

Now, we can load the photos and descriptions using the pre-defined set of train or development identifiers.

Below is the function load_clean_descriptions() that loads the cleaned text descriptions from ‘descriptions.txt‘ for a given set of identifiers and returns a dictionary of identifiers to lists of text descriptions.

The model we will develop will generate a caption given a photo, and the caption will be generated one word at a time. The sequence of previously generated words will be provided as input. Therefore, we will need a ‘first word’ to kick-off the generation process and a ‘last word‘ to signal the end of the caption.

We will use the strings ‘startseq‘ and ‘endseq‘ for this purpose. These tokens are added to the loaded descriptions as they are loaded. It is important to do this now before we encode the text so that the tokens are also encoded correctly.

Next, we can load the photo features for a given dataset.

Below defines a function named load_photo_features() that loads the entire set of photo descriptions, then returns the subset of interest for a given set of photo identifiers.

This is not very efficient; nevertheless, this will get us up and running quickly.

We can pause here and test everything developed so far.

The complete code example is listed below.

Running this example first loads the 6,000 photo identifiers in the test dataset. These features are then used to filter and load the cleaned description text and the pre-computed photo features.

We are nearly there.

The description text will need to be encoded to numbers before it can be presented to the model as in input or compared to the model’s predictions.

The first step in encoding the data is to create a consistent mapping from words to unique integer values. Keras provides the Tokenizer class that can learn this mapping from the loaded description data.

Below defines the to_lines() to convert the dictionary of descriptions into a list of strings and the create_tokenizer() function that will fit a Tokenizer given the loaded photo description text.

We can now encode the text.

Each description will be split into words. The model will be provided one word and the photo and generate the next word. Then the first two words of the description will be provided to the model as input with the image to generate the next word. This is how the model will be trained.

For example, the input sequence “little girl running in field” would be split into 6 input-output pairs to train the model:

Later, when the model is used to generate descriptions, the generated words will be concatenated and recursively provided as input to generate a caption for an image.

The function below named create_sequences(), given the tokenizer, a maximum sequence length, and the dictionary of all descriptions and photos, will transform the data into input-output pairs of data for training the model. There are two input arrays to the model: one for photo features and one for the encoded text. There is one output for the model which is the encoded next word in the text sequence.

The input text is encoded as integers, which will be fed to a word embedding layer. The photo features will be fed directly to another part of the model. The model will output a prediction, which will be a probability distribution over all words in the vocabulary.

The output data will therefore be a one-hot encoded version of each word, representing an idealized probability distribution with 0 values at all word positions except the actual word position, which has a value of 1.

We will need to calculate the maximum number of words in the longest description. A short helper function named max_length() is defined below.

We now have enough to load the data for the training and development datasets and transform the loaded data into input-output pairs for fitting a deep learning model.

Defining the Model

We will define a deep learning based on the “merge-model” described by Marc Tanti, et al. in their 2017 papers:

The authors provide a nice schematic of the model, reproduced below.

Schematic of the Merge Model For Image Captioning

Schematic of the Merge Model For Image Captioning

We will describe the model in three parts:

  • Photo Feature Extractor. This is a 16-layer VGG model pre-trained on the ImageNet dataset. We have pre-processed the photos with the VGG model (without the output layer) and will use the extracted features predicted by this model as input.
  • Sequence Processor. This is a word embedding layer for handling the text input, followed by a Long Short-Term Memory (LSTM) recurrent neural network layer.
  • Decoder (for lack of a better name). Both the feature extractor and sequence processor output a fixed-length vector. These are merged together and processed by a Dense layer to make a final prediction.

The Photo Feature Extractor model expects input photo features to be a vector of 4,096 elements. These are processed by a Dense layer to produce a 256 element representation of the photo.

The Sequence Processor model expects input sequences with a pre-defined length (34 words) which are fed into an Embedding layer that uses a mask to ignore padded values. This is followed by an LSTM layer with 256 memory units.

Both the input models produce a 256 element vector. Further, both input models use regularization in the form of 50% dropout. This is to reduce overfitting the training dataset, as this model configuration learns very fast.

The Decoder model merges the vectors from both input models using an addition operation. This is then fed to a Dense 256 neuron layer and then to a final output Dense layer that makes a softmax prediction over the entire output vocabulary for the next word in the sequence.

The function below named define_model() defines and returns the model ready to be fit.

To get a sense for the structure of the model, specifically the shapes of the layers, see the summary listed below.

We also create a plot to visualize the structure of the network that better helps understand the two streams of input.

Plot of the Caption Generation Deep Learning Model

Plot of the Caption Generation Deep Learning Model

Fitting the Model

Now that we know how to define the model, we can fit it on the training dataset.

The model learns fast and quickly overfits the training dataset. For this reason, we will monitor the skill of the trained model on the holdout development dataset. When the skill of the model on the development dataset improves at the end of an epoch, we will save the whole model to file.

At the end of the run, we can then use the saved model with the best skill on the training dataset as our final model.

We can do this by defining a ModelCheckpoint in Keras and specifying it to monitor the minimum loss on the validation dataset and save the model to a file that has both the training and validation loss in the filename.

We can then specify the checkpoint in the call to fit() via the callbacks argument. We must also specify the development dataset in fit() via the validation_data argument.

We will only fit the model for 20 epochs, but given the amount of training data, each epoch may take 30 minutes on modern hardware.

Complete Example

The complete example for fitting the model on the training data is listed below.

Running the example first prints a summary of the loaded training and development datasets.

After the summary of the model, we can get an idea of the total number of training and validation (development) input-output pairs.

The model then runs, saving the best model to .h5 files along the way.

On my run, the best validation results were saved to the file:

  • model-ep002-loss3.245-val_loss3.612.h5

This model was saved at the end of epoch 2 with a loss of 3.245 on the training dataset and a loss of 3.612 on the development dataset

Your specific results will vary.

Let me know what you get in the comments below.

If you ran the example on AWS, copy the model file back to your current working directory. If you need help with commands on AWS, see the post:

Did you get an error like:

If so, see the next section.

Train With Progressive Loading

Note: If you had no problems in the previous section, please skip this section. This section is for those who do not have enough memory to train the model as described in the previous section (e.g. cannot use AWS EC2 for whatever reason).

The training of the caption model does assume you have a lot of RAM.

The code in the previous section is not memory efficient and assumes you are running on a large EC2 instance with 32GB or 64GB of RAM. If you are running the code on a workstation of 8GB of RAM, you cannot train the model.

A workaround is to use progressive loading. This was discussed in detail in the second-last section titled “Progressive Loading” in the post:

I recommend reading that section before continuing.

If you want to use progressive loading, to train this model, this section will show you how.

The first step is we must define a function that we can use as the data generator.

We will keep things very simple and have the data generator yield one photo’s worth of data per batch. This will be all of the sequences generated for a photo and its set of descriptions.

The function below data_generator() will be the data generator and will take the loaded textual descriptions, photo features, tokenizer and max length. Here, I assume that you can fit this training data in memory, which I believe 8GB of RAM should be more than capable.

How does this work? Read the post I just mentioned above that introduces data generators.

You can see that we are calling the create_sequence() function to create a batch worth of data for a single photo rather than an entire dataset. This means that we must update the create_sequences() function to delete the “iterate over all descriptions” for-loop.

The updated function is as follows:

We now have pretty much everything we need.

Note, this is a very basic data generator. The big memory saving it offers is to not have the unrolled sequences of train and test data in memory prior to fitting the model, that these samples (e.g. results from create_sequences()) are created as needed per photo.

Some off-the-cuff ideas for further improving this data generator include:

  • Randomize the order of photos each epoch.
  • Work with a list of photo ids and load text and photo data as needed to cut even further back on memory.
  • Yield more than one photo’s worth of samples per batch.

I have experienced with these variations myself in the past. Let me know if you do and how you go in the comments.

You can sanity check a data generator by calling it directly, as follows:

Running this sanity check will show what one batch worth of sequences looks like, in this case 47 samples to train on for the first photo.

Finally, we can use the fit_generator() function on the model to train the model with this data generator.

In this simple example we will discard the loading of the development dataset and model checkpointing and simply save the model after each training epoch. You can then go back and load/evaluate each saved model after training to find the one we the lowest loss that you can then use in the next section.

The code to train the model with the data generator is as follows:

That’s it. You can now train the model using progressive loading and save a ton of RAM. This may also be a lot slower.

The complete updated example with progressive loading (use of the data generator) for training the caption generation model is listed below.

Did you use this new addition to the tutorial?

How did you go?

Evaluate Model

Once the model is fit, we can evaluate the skill of its predictions on the holdout test dataset.

We will evaluate a model by generating descriptions for all photos in the test dataset and evaluating those predictions with a standard cost function.

First, we need to be able to generate a description for a photo using a trained model.

This involves passing in the start description token ‘startseq‘, generating one word, then calling the model recursively with generated words as input until the end of sequence token is reached ‘endseq‘ or the maximum description length is reached.

The function below named generate_desc() implements this behavior and generates a textual description given a trained model, and a given prepared photo as input. It calls the function word_for_id() in order to map an integer prediction back to a word.

We will generate predictions for all photos in the test dataset and in the train dataset.

The function below named evaluate_model() will evaluate a trained model against a given dataset of photo descriptions and photo features. The actual and predicted descriptions are collected and evaluated collectively using the corpus BLEU score that summarizes how close the generated text is to the expected text.

BLEU scores are used in text translation for evaluating translated text against one or more reference translations.

Here, we compare each generated description against all of the reference descriptions for the photograph. We then calculate BLEU scores for 1, 2, 3 and 4 cumulative n-grams.

You can learn more about the BLEU score here:

The NLTK Python library implements the BLEU score calculation in the corpus_bleu() function. A higher score close to 1.0 is better, a score closer to zero is worse.

We can put all of this together with the functions from the previous section for loading the data. We first need to load the training dataset in order to prepare a Tokenizer so that we can encode generated words as input sequences for the model. It is critical that we encode the generated words using exactly the same encoding scheme as was used when training the model.

We then use these functions for loading the test dataset.

The complete example is listed below.

Running the example prints the BLEU scores.

We can see that the scores fit within and close to the top of the expected range of a skillful model on the problem. The chosen model configuration is by no means optimized.

Generate New Captions

Now that we know how to develop and evaluate a caption generation model, how can we use it?

Almost everything we need to generate captions for entirely new photographs is in the model file.

We also need the Tokenizer for encoding generated words for the model while generating a sequence, and the maximum length of input sequences, used when we defined the model (e.g. 34).

We can hard code the maximum sequence length. With the encoding of text, we can create the tokenizer and save it to a file so that we can load it quickly whenever we need it without needing the entire Flickr8K dataset. An alternative would be to use our own vocabulary file and mapping to integers function during training.

We can create the Tokenizer as before and save it as a pickle file tokenizer.pkl. The complete example is listed below.

We can now load the tokenizer whenever we need it without having to load the entire training dataset of annotations.

Now, let’s generate a description for a new photograph.

Below is a new photograph that I chose randomly on Flickr (available under a permissive license).

Photo of a dog at the beach.

Photo of a dog at the beach.
Photo by bambe1964, some rights reserved.

We will generate a description for it using our model.

Download the photograph and save it to your local directory with the filename “example.jpg“.

First, we must load the Tokenizer from tokenizer.pkl and define the maximum length of the sequence to generate, needed for padding inputs.

Then we must load the model, as before.

Next, we must load the photo we which to describe and extract the features.

We could do this by re-defining the model and adding the VGG-16 model to it, or we can use the VGG model to predict the features and use them as inputs to our existing model. We will do the latter and use a modified version of the extract_features() function used during data preparation, but adapted to work on a single photo.

We can then generate a description using the generate_desc() function defined when evaluating the model.

The complete example for generating a description for an entirely new standalone photograph is listed below.

In this case, the description generated was as follows:

You could remove the start and end tokens and you would have the basis for a nice automatic photo captioning model.

It’s like living in the future guys!

It still completely blows my mind that we can do this. Wow.


This section lists some ideas for extending the tutorial that you may wish to explore.

  • Alternate Pre-Trained Photo Models. A small 16-layer VGG model was used for feature extraction. Consider exploring larger models that offer better performance on the ImageNet dataset, such as Inception.
  • Smaller Vocabulary. A larger vocabulary of nearly eight thousand words was used in the development of the model. Many of the words supported may be misspellings or only used once in the entire dataset. Refine the vocabulary and reduce the size, perhaps by half.
  • Pre-trained Word Vectors. The model learned the word vectors as part of fitting the model. Better performance may be achieved by using word vectors either pre-trained on the training dataset or trained on a much larger corpus of text, such as news articles or Wikipedia.
  • Tune Model. The configuration of the model was not tuned on the problem. Explore alternate configurations and see if you can achieve better performance.

Did you try any of these extensions? Share your results in the comments below.

Further Reading

This section provides more resources on the topic if you are looking go deeper.

Caption Generation Papers

Flickr8K Dataset



In this tutorial, you discovered how to develop a photo captioning deep learning model from scratch.

Specifically, you learned:

  • How to prepare photo and text data ready for training a deep learning model.
  • How to design and train a deep learning caption generation model.
  • How to evaluate a train caption generation model and use it to caption entirely new photographs.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Develop Deep Learning models for Text Data Today!

Deep Learning for Natural Language Processing

Develop Your Own Text models in Minutes

…with just a few lines of python code

Discover how in my new Ebook:
Deep Learning for Natural Language Processing

It provides self-study tutorials on topics like:
Bag-of-Words, Word Embedding, Language Models, Caption Generation, Text Translation and much more…

Finally Bring Deep Learning to your Natural Language Processing Projects

Skip the Academics. Just Results.

Click to learn more.

203 Responses to How to Develop a Deep Learning Photo Caption Generator from Scratch

  1. Christian Beckmann November 28, 2017 at 3:21 am #

    Hi Jason,

    thanks for this great article about image caption!

    My results after training were a bit worse (loss 3.566 – val_loss 3.859, then started to overfit) so i decided to try keras.applications.inception_v3.InceptionV3 for the base model. Currently it is still running and i am curious to see if it will do better.

  2. Akash November 30, 2017 at 4:56 am #

    Hi Jason,
    Once again great Article.
    I ran into some error while executing the code under “Complete example ” section.
    The error I got was
    ValueError: Error when checking target: expected dense_3 to have shape (None, 7579) but got array with shape (306404, 1)
    Any idea how to fix this?

    • Jason Brownlee November 30, 2017 at 8:26 am #

      Hi Akash, nice catch.

      The fault appears to have been introduced in a recent version of Keras in the to_categorical() function. I can confirm the fault occurs with Keras 2.1.1.

      You can learn more about the fault here:

      There are two options:

      1. Downgrade Keras to 2.0.8


      2. Modify the code, change line 104 in the training code example from:


      I hope that helps.

      • Akash November 30, 2017 at 5:38 pm #

        Thanks Jason. It’s working now.
        Can you suggest the changes to be made to use Inception model and word embedding like word2vec.

  3. Zoltan November 30, 2017 at 11:47 pm #

    Hi Jason,

    Big thumbs up, nicely written, really informative article. I especially like the step by step approach.

    But when I tried to go through it, I got an error in load_poto_features saying that “name ‘load’ not defined”. Which is kinda odd.

    Otherwise everything seems fine.

    • Jason Brownlee December 1, 2017 at 7:35 am #


      Perhaps double check you have the load function imported from pickle?

  4. Bikram Kachari December 1, 2017 at 4:59 pm #

    Hi Jason

    I am a regular follower of your tutorials. They are great. I got to learn a lot. Thank you so much. Please keep up the good work

  5. maibam December 1, 2017 at 7:05 pm #

    Layer (type) Output Shape Param # Connected to
    input_2 (InputLayer) (None, 34) 0
    input_1 (InputLayer) (None, 4096) 0
    embedding_1 (Embedding) (None, 34, 256) 1940224 input_2[0][0]
    dropout_1 (Dropout) (None, 4096) 0 input_1[0][0]
    dropout_2 (Dropout) (None, 34, 256) 0 embedding_1[0][0]
    dense_1 (Dense) (None, 256) 1048832 dropout_1[0][0]
    lstm_1 (LSTM) (None, 256) 525312 dropout_2[0][0]
    add_1 (Add) (None, 256) 0 dense_1[0][0]
    dense_2 (Dense) (None, 256) 65792 add_1[0][0]
    dense_3 (Dense) (None, 7579) 1947803 dense_2[0][0]
    Total params: 5,527,963
    Trainable params: 5,527,963
    Non-trainable params: 0

    ValueError: Error when checking input: expected input_1 to have 2 dimensions, but got array with shape (306404, 7, 7, 512)

    Getting error during[X1train, X2train], ytrain, epochs=20, verbose=2, callbacks=[checkpoint], validation_data=([X1test, X2test], ytest))

    Keras 2.0.8 with tensorflow
    what is wrong ?

    • Jason Brownlee December 2, 2017 at 8:51 am #

      Not sure, did you copy all of the code exactly?

      Is your numpy and tensorflow also up to date?

      • Christian January 16, 2018 at 10:09 pm #

        This looks like he did change the network for feature extraction. When using include_top=False and wheigts=’imagenet” you get this type of data structure.

  6. Vik December 2, 2017 at 7:16 pm #

    Thank you for the article. It is great to see full pipeline.
    Always following your articles with admiration

  7. Gonzalo Gasca Meza December 4, 2017 at 10:42 am #

    In the prepare data section, if using Python 2.7 there is no str.maketrans method.
    To make this work just comment that line and in line 46 do this:
    desc = [w.translate(None, string.punctuation) for w in desc]

    • Jason Brownlee December 4, 2017 at 4:57 pm #

      Thanks Gonzalo!

    • Bani March 8, 2018 at 4:26 am #

      after using the function to_vocabulary()
      I am getting a vocabulary of size 24 which is too less though I have followed the code line by line.
      Can u help?

      • Jason Brownlee March 8, 2018 at 6:36 am #

        Are you able to confirm that your Python is version 3.5+ and that you have the latest version of all libraries installed?

  8. Minel December 11, 2017 at 6:17 pm #

    Hi Jason,
    I am using your code step by step. There is a light mistake :
    you wrote
    # save descriptions
    save_doc(descriptions, ‘descriptions.txt’)

    in fact the right intruction is
    # save descriptions
    save_descriptions(descriptions, ‘descriptions.txt’)

    as you wrote in the final example

  9. Minel December 11, 2017 at 6:34 pm #

    Hi jason
    Another small detail. I had to write
    from pickle import load
    to run the instruction
    all_features = load(open(filename, ‘rb’))


  10. Minel December 11, 2017 at 9:32 pm #

    Hi Jason,
    I met some trouble running your code. I got a MemoryError on the instruction :
    return array(X1), array(X2), array(y)

    I am using a virtual machine with Linux (Debian), Python3, with 32Giga of memory.
    Could you tell me what was the size of the memory on the computer you used to check your program ?


  11. Minel December 12, 2017 at 11:34 pm #

    Thank for the advice.In fact, I upgraded the VM (64Go, 16 cores) and it worked fine (using 45Go of memory)

    • Jason Brownlee December 13, 2017 at 5:35 am #

      Nice! Glad to hear it.

      • Vineeth March 3, 2018 at 12:32 am #

        I get the same error even with 64GB VM :/ What to do

        • Jason Brownlee March 3, 2018 at 8:13 am #

          I’m sorry to hear that, perhaps there is something else going on with your workstation?

          I can confirm the example works on workstations and on EC2 instances with and without GPUs.

          • Vineeth March 3, 2018 at 10:06 pm #

            It’s throwing a Value error for input_1 after sometime. I tried everything i can but i am not able to understand. Can you paste the link of your project so i can compare ?

          • Jason Brownlee March 4, 2018 at 6:03 am #

            Are you able to confirm that your Python environment is up to date?

          • Vineeth March 3, 2018 at 10:26 pm #

            And sir, You said the pickle size must be about 127Mb but mine turns out to be above 700MB what did i do wrong ?

          • Jason Brownlee March 4, 2018 at 6:04 am #

            The size may be different on different platforms (macos/linux/windows).

  12. Josh Ash December 17, 2017 at 9:56 pm #

    Hi Jason – hello from Queensland 🙂
    Your tutorials on applied ML in Python are the best on the net hands down, thanks for putting them together!

  13. Madhivarman December 18, 2017 at 7:12 pm #

    hai Jason.. When i run the script my lap freeze…I don’t know whether its training or not.Did anyone face this issue ?


  14. Muhammad Awais December 20, 2017 at 3:36 pm #

    Thanks for such a great work. I found an error message when running a code
    FileNotFoundError: [Errno 2] No such file or directory: ‘descriptions.txt’
    Please help

    • Jason Brownlee December 20, 2017 at 3:50 pm #

      Ensure you generate the descriptions file before running the prior model – check the tutorial steps again and ensure you execute each in turn.

  15. Daniel F December 21, 2017 at 4:31 am #

    Hi Jason,

    I’m getting a MemoryError when I try to prepare the training sequences:

    Traceback (most recent call last):
    File “C:\Users\Daniel\Desktop\project\”, line 154, in
    X1train, X2train, ytrain = create_sequences(tokenizer, max_length, train_descriptions, train_features)
    File “C:\Users\Daniel\Desktop\project\”, line 104, in create_sequences
    out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
    File “C:\Program Files\Anaconda3\lib\site-packages\keras\utils\”, line 24, in to_categorical
    categorical = np.zeros((n, num_classes))

    any advice? I have 8GB of RAM.

  16. zonetrooper32 December 28, 2017 at 3:12 am #

    Hi Jason,

    Thank you for this amazing article about image captioning.

    Currently I am trying to re-implement the whole code, except that I am doing it in pure Tensorflow. I’m curious to see if my re-implementation is working as smooth as yours.

    Also a shower thought, it might be better to get a better vector representations for words if using the pretrained word2vec embeddings, for example Glove 6B or GoogleNews. Learning embeddings from scratch with only 8k words might have some performance loss.

    Again thank you for putting everything together, it will take quite some time to implement from scratch without your tutorial.

    • Jason Brownlee December 28, 2017 at 5:26 am #

      Try it and see if it lifts model skill. Let me know how you go.

  17. Sasikanth January 8, 2018 at 5:04 pm #

    Hello Jason,
    Is there a R package to perform modeling of images?


  18. Marco January 16, 2018 at 10:08 pm #

    Hi Jason! Thanks for your amazing tutorial! I have a question. I don’t understand the meaning of the number 1 on this line (extract_features):
    image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))

    Can you explain me what reshape does and the meaning of the arguments?

    Thanks in advance.

  19. junhyung yu January 22, 2018 at 8:54 pm #

    Hi Jason! thank you for your great code.
    but i have one question.

    How long does it take to execute under code?

    # define the model
    model = define_model(vocab_size, max_length)

    This code does not run during the third day.

    I think that “se3 = LSTM(256)(se2)” code in define_model function is causing the problem.

    My computer configuration is like this.

    Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz – 6 core
    Ram 62G
    GeForce GTX TITAN X – 2core

    please help me~~

    • Jason Brownlee January 23, 2018 at 7:55 am #

      Ouch, something is wrong.

      Perhaps try running on AWS?

      Perhaps try other models and test your rig/setup?

      Perhaps try fewer epochs or a smaller model to see if your setup can train the model at all?

      • junhyung yu January 23, 2018 at 3:29 pm #

        1. No. i try running on my indicvdual linux server and using jupyter notebook

        2. No i am using only your code , no other model, no modify

        3.[X1train, X2train], ytrain, epochs=20, verbose=1, callbacks=[checkpoint], validation_data=([X1test, X2test], ytest))

        This code has not yet been executed

        so I do not think epoch is a problem.

        • Jason Brownlee January 24, 2018 at 9:50 am #

          Perhaps run from the command line as a background process without notebook?

          Perhaps check memory usage and cpu/gpu utilization?

  20. krishna January 23, 2018 at 10:41 pm #

    ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host

    hi sir… I am getting this error above when i run feature extract code.

    • Jason Brownlee January 24, 2018 at 9:55 am #

      Sorry, I have not seen that error.

    • Hiroshi February 26, 2018 at 1:01 pm #

      Hi Krishna,

      I’m also getting this error time to time. Were you able to solve this issue?

  21. Sathiya_Chakra January 28, 2018 at 7:05 am #

    Hi Jason!

    Is it possible to run this neural network on a 8GB RAM laptop with 2GB Graphics card with Intel core i5 processor?

    • Jason Brownlee January 28, 2018 at 8:28 am #


      You might need to adjust it to use progressive loading so that it does not try to hold the entire dataset in RAM.

  22. Ajit Tiwari January 29, 2018 at 10:46 pm #

    Hi Jason,
    Can you provide a link for the tokenizer as well as the model file.
    I Cannot train this model in my system but would like to see if I can use it to create an Android app

  23. Soumya February 1, 2018 at 10:19 pm #

    When I am running

    tokenizer = Tokenizer()

    I am getting error,

    Traceback (most recent call last):
    File “”, line 1, in
    NameError: name ‘Tokenizer’ is not defined

    How to solve this. Any idea please.

  24. Marco February 9, 2018 at 12:41 am #

    Hi Jason, thanks for the tutorial! I want to ask you if you could explain (or send me some links), to better understand, how exactly the fitting works.

    Example description: the girl is …

    The LSTM network during fitting takes the beginning of the sequence of my description (startseq) and it produces a vector with all possible subsequent words. This vector is combined with the vector of the input image features and it is passed within an FF layer where we then take the most probable word (with softmax). it’s right?

    At this point how does the fitting go on? Is the new sequence (e.g startseq – the) passed into the LSTM network, predicts all possible next words, etc.? Continuing this way up to endseq?

    If the network incorrectly generates the next word, what happens? How are the weights arranged? The fitting continues by taking in input “startseq – wrong_word” or continues with the correct one (eg startseq – the)?

    Thanks for your help

  25. Sumit Das February 13, 2018 at 6:10 pm #

    Hi Jason great article on caption generator i think the best till now available online.. i am a newbee in ML(AI). i extracted the features and stored it to features.pkl file but getting an error on create sequence functions memory error and i can see you have suggested progressive loading i do not get that properly could you suggest my how to use the current code modified for progressive loading::

    [‎2/‎13/‎2018 12:34 PM] Sanchawat, Hardik:
    Using TensorFlow backend.
    Dataset: 6000
    Descriptions: train=6000
    Photos: train=6000
    Vocabulary Size: 7579
    Description Length: 34
    Traceback (most recent call last):
    File “C:\Users\hardik.sanchawat\Documents\Scripts\flickr\”, line 154, in
    X1train, X2train, ytrain = create_sequences(tokenizer, max_length, train_descriptions, train_features)
    File “C:\Users\hardik.sanchawat\Documents\Scripts\flickr\”, line 109, in create_sequences
    return array(X1), array(X2), array(y)

    My system configuration is :

    OS: Windows 10
    Processor: AMD A8 PRO-7150B R5, 10 Compute Cores 4C+6G 1.90 GHz
    Memory(RAM): 16 GB (14.9GB Usable)
    System type: 64-bit OS, x64-based processor

  26. Kavya February 14, 2018 at 8:35 am #

    Hi Jason,

    I am trying to using plot _model . but I getting error

    raise ImportError(‘Failed to import pydot. You must install pydot’

    ImportError: Failed to import pydot. You must install pydot and graphviz for pydotprint to work.

    I tried
    conda install graphviz
    conda install pydotplus

    to install pydot.
    my python version is3.x
    eras vesion is 2.1.3

    Could you please help me , to solve this problem

    • Jason Brownlee February 14, 2018 at 2:40 pm #

      I’m sorry to hear that.

      Perhaps the installed libraries are not available in your current Python environment?

      Perhaps try posting the error to stackoverflow? I’m not an expert at debugging workstations.

    • Vineeth February 14, 2018 at 5:13 pm #

      If you are on windows go here and install this, 2.38 stable msi file.

      after that, add the graphviz’s bin onto your system PATH variables. Restart your computer and the path should be picked up.

      Then you won’t have that error again.

      • Kavya February 17, 2018 at 2:36 pm #

        Thanks Vinneth,
        I am using Mac. I tried toes pydotplus, but still its giving same error.

    • Precious Angrish May 2, 2018 at 10:34 am #


      I am getting the same error, how did you fix it?

      Precious Angrish

    • Sayan May 14, 2018 at 3:05 am #

      Hey Kavya i assume this will surely resolve your error , as it also worked for me as well,

  27. Vineeth February 14, 2018 at 9:02 pm #

    I used Progressive Loading from This tutorial and updated the input layer to inputs1 = Input(shape=(224, 224, 3))

    And i got the error
    ValueError: Error when checking target: expected dense_3 to have 4 dimensions, but got array with shape (13, 4485)

    Then i updated to_categorical function as you mentioned and the error changed to this
    ValueError: Error when checking target: expected dense_3 to have 4 dimensions, but got array with shape (13, 1, 4485)

    Been trying to figure out the exact input shapes of the model since 2 days please help 🙁

    • Srinath Hanumantha Rao March 21, 2018 at 7:58 pm #

      Hey Vineeth!

      Were you able to solve this issue? I am stuck on this for a few days too.

      • Jason Brownlee March 22, 2018 at 6:21 am #

        Are you able to confirm your Python and Keras versions?

  28. Alex February 21, 2018 at 12:30 am #

    Hi Jason, why do you apply dropout to the input instead to applying it to the dense layer?

    • Jason Brownlee February 21, 2018 at 6:40 am #

      I used a little experimentation to come up with the model.

      Try changing it up and see if you can lift skill or reduce training time or model complexity Alex. I’m eager to hear how you go.

  29. Sunny February 28, 2018 at 7:23 am #

    Hi Jason,

    I just wanted to know that when you are loading the training data, you are tokenizing the train descriptions. But when you are working with test data, you are not tokenizing the test descriptions, instead working with the previous tokens. Shouldn’t the test descriptions be tokenized too before passing to create_sequence for test ?

  30. Hgarrison March 7, 2018 at 8:44 am #

    Hi Jason,

    This tutorial is of great help to us all, I think. I have a question: Does the model eventually learn to predict captions not present in the corpus? I mean, is it possible for the model to output sentences that are never seen before? In the example you give, the model predicted “startseq dog is running across the beach endseq”. Is this sentence found in the training corpus, or did the model make it up based on previous observations? And also, If it is possible for the model to combine sentences, how much training data do you think it needs to do that?

    • Jason Brownlee March 7, 2018 at 3:04 pm #

      The model attempts to generalize beyond what it has seen during training.

      In fact, this is the goal with a machine learning model.

      Nevertheless, the model will be bounded by the types of text and images seen during training, just not the specific combinations.

  31. Giuseppe March 8, 2018 at 12:05 am #

    Hi Jeson, I have a question. What exactly is the LSTM used for? During fitting it takes an input (eg startseq – girl) and outputs a vector of 256 elements that contain the most probable words after the prefix? Is it trained through backpropagation? The purpose of the fitting is to make sure that given a prefix / input the LSTM gives me back a vector that represents “better” the possible following words (which are then merge with the features, etc …)

    • Jason Brownlee March 8, 2018 at 6:32 am #

      It is used for interpreting the text generated so far, needed to generate the next word.

  32. fatma March 16, 2018 at 8:16 pm #

    Hi Jason,

    for the line:

    features = dict()

    I got syntaxerror: invalid syntax

    How can I fix this error?

    • Jason Brownlee March 17, 2018 at 8:36 am #

      Perhaps double check that you have copied the code while maintaining white space?

      Perhaps confirm Python 3?

  33. fatma March 20, 2018 at 10:21 pm #

    Hi Jason,

    is the following line:

    model = Model(inputs=model.inputs, outputs=model.layers[-1].output)

    means we will save the features of fc2 layer of the vgg16 model?

    • Jason Brownlee March 21, 2018 at 6:33 am #

      We are creating a new model without the last layer.

      • fatma March 21, 2018 at 3:54 pm #

        the new model doesn’t contain any fully connected layer because I read that we can extract the features from the fc2 layers of the pre-trained model also

        • fatma March 21, 2018 at 4:35 pm #

          when I run the line model.summary() I got the last layer is :

          block5_conv4 (Conv2D) (None, 14, 14, 512) 2359808

          but according to the VGG16 it should be

          fc2 (Dense) (None, 4096) 16781312 fc1[0][0]

          I don’t know where is the problem?

        • fatma March 23, 2018 at 9:27 pm #

          Hi Jason,

          how we can feed the saved features in the pickle file (features.pkl) to a linear regression model

          • Jason Brownlee March 24, 2018 at 6:27 am #

            That would be a lot of input features! Sorry, I don’t have a worked example.

  34. Akash March 21, 2018 at 7:04 am #

    ValueError: Error when checking input: expected input_1 to have shape (None, 4096) but got array with shape (0, 1)

    I am getting this error..can anyone help me understand and fix it?

    • Jason Brownlee March 21, 2018 at 3:03 pm #

      Are you able to confirm that you have Python3 and all libs up to date?

      • Akash March 21, 2018 at 9:16 pm #

        Yes all my libraries are upto date, have checked.
        I solved the problem i posted before….my problem was in the data generator.
        I am using progressive loading.After fixing the problem i checked my inputs using this code:

        generator = data_generator(descriptions, tokenizer, max_length)
        inputs, outputs = next(generator)

        and it’s giving me an output like this:

        (13, 224, 224, 3)
        (13, 28)
        (13, 4485)

        but now it’s showing this error:
        ValueError: Error when checking input: expected input_1 to have 2 dimensions, but got array with shape (8, 224, 224, 3)

        do i have to change the model architecture for progressive loading??

        NOTE:for progressive loading have used this code:

        • Steven March 22, 2018 at 9:57 pm #

          I am stock with the same issue. The example above runs me into memory problems even when I tried it using AWS EC2 g2.2xlarge instance or a laptop with 16 GB RAM. So I tried the progressive loading example you referred to frequently but I have the same trouble with the input of the model. I tried to use inputs[0] as inputs1 for the define_model function but that returned the error ‘Error when checking input: expected input_13 to have 5 dimensions, but got array with shape (13, 224, 224, 3)’. Do I have to reshape input[0], or is the problem in inputs2?

          • Akash March 23, 2018 at 6:29 pm #

            I think the model architecture needs to be changed for the progressive loading example particularly the input shapes.

    • Harsha April 2, 2018 at 9:27 pm #

      getting the same error for me
      File “”, line 189, in[X1train, X2train], ytrain, epochs=20, verbose=2, callbacks=[checkpoint], validation_data=([X1test, X2test], ytest))
      File “C:\Users\pranyaram\Anaconda3\envs\tensorflow\lib\site-packages\keras\engine\”, line 1630, in fit
      File “C:\Users\pranyaram\Anaconda3\envs\tensorflow\lib\site-packages\keras\engine\”, line 1476, in _standardize_user_data
      File “C:\Users\pranyaram\Anaconda3\envs\tensorflow\lib\site-packages\keras\engine\”, line 123, in _standardize_input_data
      ValueError: Error when checking input: expected input_1 to have shape (4096,) but got array with shape (1,)

      • Jason Brownlee April 3, 2018 at 6:33 am #

        What version of libs are you using?

        Here’s what I’m running:

  35. Tanisha March 31, 2018 at 5:50 pm #

    Hi Jason,
    Thanks for the article.

    Due to lack of resources I tried running this in small amount of data.Everything worked fine but the generating new description part is giving this error.

    C:\Users\Tanisha\AppData\Local\conda\conda\envs\tensorflow\lib\site-packages\h5py\ FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
    from ._conv import register_converters as _register_converters
    Using TensorFlow backend.
    2018-03-31 12:07:43.176707: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\platform\] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
    2018-03-31 12:07:43.574792: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\] Found device 0 with properties:
    name: GeForce 820M major: 2 minor: 1 memoryClockRate(GHz): 1.25
    pciBusID: 0000:08:00.0
    totalMemory: 2.00GiB freeMemory: 1.65GiB
    2018-03-31 12:07:43.584220: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\] Ignoring visible gpu device (device: 0, name: GeForce 820M, pci bus id: 0000:08:00.0, compute capability: 2.1) with Cuda compute capability 2.1. The minimum required Cuda capability is 3.0.
    Traceback (most recent call last):
    File “”, line 72, in
    description = generate_desc(model, tokenizer, photo, max_length)
    File “”, line 48, in generate_desc
    yhat = model.predict([photo,sequence], verbose=0)
    File “C:\Users\Tanisha\AppData\Local\conda\conda\envs\tensorflow\lib\site-packages\keras\engine\”, line 1817, in predict
    File “C:\Users\Tanisha\AppData\Local\conda\conda\envs\tensorflow\lib\site-packages\keras\engine\”, line 123, in _standardize_input_data
    ValueError: Error when checking : expected input_2 to have shape (25,) but got array with shape (34,)

    Any idea how can i fix this ?

    • Jason Brownlee April 1, 2018 at 5:46 am #

      Are you able to confirm that your Keras version and TF are up to date?

      Did you copy all of the code as is?

      • Tanisha April 5, 2018 at 11:52 am #

        Yeah those two are updated i just changed “max_length = 34” to “max_length = 25” in the code and now its working.

  36. Harsha April 1, 2018 at 2:46 pm #

    i am getting this error
    X1train, X2train, ytrain = create_sequences(tokenizer, max_length, train_descriptions, train_features)
    File “”, line 109, in create_sequences
    return array(X1), array(X2), array(y)

  37. pramod choudhari April 1, 2018 at 4:07 pm #

    what backend are you using??

  38. anurag vats April 2, 2018 at 3:26 pm #

    can some one give me this file “model-ep{epoch:03d}-loss{loss:.3f}-val_loss{val_loss:.3f}.h5”
    my pc don’t have enough processing power .

  39. Harsha April 2, 2018 at 6:14 pm #

    ile “”, line 189, in[X1train, X2train], ytrain, epochs=20, verbose=2, callbacks=[checkpoint], validation_data=([X1test, X2test], ytest))
    File “C:\Users\pranyaram\Anaconda3\envs\tensorflow\lib\site-packages\keras\engine\”, line 1522, in fit
    File “C:\Users\pranyaram\Anaconda3\envs\tensorflow\lib\site-packages\keras\engine\”, line 1378, in _standardize_user_data
    File “C:\Users\pranyaram\Anaconda3\envs\tensorflow\lib\site-packages\keras\engine\”, line 144, in _standardize_input_data
    ValueError: Error when checking input: expected input_1 to have shape (None, 4096) but got array with shape (0, 1)

    • Jason Brownlee April 3, 2018 at 6:32 am #

      Are you able to confirm that you are using Python 3 and that your version of Keras is up to date?

      • Harsha April 3, 2018 at 2:31 pm #

        which keras version should i use

        • Jason Brownlee April 4, 2018 at 6:04 am #

          The most recent.

          • Harsha April 4, 2018 at 1:42 pm #

            even still i am getting the same error once check the model training file how to reduce the training size to avoid memory error.

          • Jason Brownlee April 5, 2018 at 5:52 am #

            You can use progressive loading to reduce the memory requirements for the model.

            Update: I have updated the tutorial to include an example of training using progressive loading (a data generator).

  40. Lazuardi April 3, 2018 at 3:44 am #

    Hello, Jason! Thank you for your tutorial.

    I tried to use pre-trained model and copy-paste the code above to my Anaconda python 3.6 and Keras version of 2.1.5. First, it will run smoothly without any problem, and it begins to crawl on several image files. Unfortunately, after a while, I get this kind of error:

    “OSError: cannot identify image file ‘Flicker8k_Dataset/”

    Any idea what is wrong? I am running it on my laptop with GPU NVIDIA GeForce 1050 Ti with Intel Core i7-7700HQ with Windows 10 OS.

    Thank you in advance!

    • Jason Brownlee April 3, 2018 at 6:40 am #

      Looks like something very strange is going on.

      I have not seen this error. Perhaps try running from the commandline, often notebooks and IDEs introduce new and crazy faults of their own.

  41. goutham April 4, 2018 at 1:48 pm #

    Using TensorFlow backend.
    Dataset: 6000
    Descriptions: train=6000
    Photos: train=6000
    Vocabulary Size: 7579
    Description Length: 34
    Traceback (most recent call last):
    File “”, line 154, in
    X1train, X2train, ytrain = create_sequences(tokenizer, max_length, train_descriptions, train_features)
    File “”, line 109, in create_sequences
    return array(X1), array(X2), array(y)

    how to reduce the training size to avoid this error.

    • Jason Brownlee April 5, 2018 at 5:52 am #

      You can use progressive loading to reduce the memory requirements for the model.

    • Belgaroui April 15, 2018 at 10:22 pm #

      I got the same error “OSError: cannot identify image file ‘Flicker8k_Dataset/desktop.ini'” did you fix it?

      • Jason Brownlee April 16, 2018 at 6:10 am #

        Looks like you have a windows file called desktop.ini in the directory for some reason. Delete it.

  42. harsha April 4, 2018 at 5:58 pm #

    Hi, can you provide me the weights file. My laptop is having 12GB RAM, NVIDIA GeForce 820M Graphics, all supported drivers. But Iam getting the memory error issue.

    I have tried progressive loading also.. But it is not working.. It is not saving the weights file even after steps per epoch=70000 is completed even. I cant afford for the AWS.
    So, I request you to give me the weights file.
    Thanks in advance.

    • Jason Brownlee April 5, 2018 at 5:53 am #

      Sorry, I cannot share the weights file.

      I will schedule time into updating the tutorial to add a progressive loading example.

      Update: I have updated the tutorial to include an example of training using progressive loading (a data generator).

  43. manish April 5, 2018 at 12:58 am #

    I got an error while generating the captions.

    Here is the error:

    Traceback (most recent call last):
    File “”, line 64, in
    tokenizer = load(open(‘descriptions.txt’, ‘rb’))
    _pickle.UnpicklingError: could not find MARK

    • Jason Brownlee April 5, 2018 at 6:09 am #

      I have not seen this error before, sorry. Perhaps try running the code again?

  44. harsha April 5, 2018 at 4:50 am #

    startseq man in red shirt is standing on the street endseq

    caption is generating but it is giving same caption for different images.

  45. manish April 5, 2018 at 2:16 pm #

    val-loss is improving up to 3 epoches only, there’s no any improvement in further epoches.

    model-ep003-loss3.662-val_loss3.824.h5. This is the last epoche that has improved till now.

  46. SAI April 8, 2018 at 12:49 am #

    File “”, line 1, in
    runfile(‘C:/Users/Owner/.spyder-py3/ML/’, wdir=’C:/Users/Owner/.spyder-py3/ML’)

    File “C:\Users\Owner\Anaconda_3\lib\site-packages\spyder\utils\site\”, line 705, in runfile
    execfile(filename, namespace)

    File “C:\Users\Owner\Anaconda_3\lib\site-packages\spyder\utils\site\”, line 102, in execfile
    exec(compile(, filename, ‘exec’), namespace)

    File “C:/Users/Owner/.spyder-py3/ML/”, line 161, in
    model = define_model(vocab_size, max_length)

    File “C:/Users/Owner/.spyder-py3/ML/”, line 129, in define_model
    plot_model(model, to_file=’model.png’, show_shapes=True)

    File “C:\Users\Owner\Anaconda_3\lib\site-packages\keras\utils\”, line 135, in plot_model
    dot = model_to_dot(model, show_shapes, show_layer_names, rankdir)

    File “C:\Users\Owner\Anaconda_3\lib\site-packages\keras\utils\”, line 56, in model_to_dot

    File “C:\Users\Owner\Anaconda_3\lib\site-packages\keras\utils\”, line 31, in _check_pydot
    raise ImportError(‘Failed to import pydot. You must install pydot’

    ImportError: Failed to import pydot. You must install pydot and graphviz for pydotprint to work.

    getting this even if i installed pydot and graphviz

    • Jason Brownlee April 8, 2018 at 6:22 am #

      Perhaps restart your machine?

      Perhaps comment out the part where you visualize the model?

    • deep_ml April 9, 2018 at 3:23 am #

      getting same error!
      Tried using solution from stackoverflow, upgraded packages..but it ain’t working..

      • Jason Brownlee April 9, 2018 at 6:12 am #

        No problem, just skip that part and proceed. Comment out the plotting of the model.

  47. deep_ml April 9, 2018 at 4:06 pm #

    I have trained the data using progressive loading and I stopped after 4 iterations, with a loss of 3.4952.

    I am unable to understand this part,
    In this simple example we will discard the loading of the development dataset and model checkpointing and simply save the model after each training epoch. You can then go back and load/evaluate each saved model after training to find the one we the lowest loss that you can then use in the next section.

    Do you mean we have to load test set in the same way using progressive loading ?
    Please help me understanding how to load the test set.

    • Jason Brownlee April 10, 2018 at 6:15 am #

      I am suggesting that you may want to load the test data in the existing way and evaluate your model (next section).

  48. Jesia April 11, 2018 at 6:25 pm #

    Error by runing “The complete code example is listed below.” in the Loading Data section:

    Message Body:
    Dataset: 6000
    Descriptions: train=6000
    Traceback (most recent call last):
    File “”, line 64, in
    train_features = load_photo_features(‘features.pkl’, train)
    File “”, line 53, in load_photo_features
    features = {k: all_features[k] for k in dataset}
    File “”, line 53, in
    features = {k: all_features[k] for k in dataset}
    KeyError: ‘878758390_dd2cdc42f6’

    • Jason Brownlee April 12, 2018 at 8:35 am #

      Perhaps confirm that you have the full dataset in place?

      • Jesia April 24, 2018 at 11:22 pm #

        Yes, some images were missed.

        Thank you

  49. Belgaroui April 12, 2018 at 12:31 am #

    Hello sir I’m learning from your articles that I find very informative and educational, I’ve been trying to compile this code :
    # extract features from all images
    directory = ‘Flicker8k_Dataset’
    features = extract_features(directory)
    print(‘Extracted Features: %d’ % len(features))
    # save to file
    dump(features, open(‘features.pkl’, ‘wb’))

    but an error occurred and I don’t understand it can you help me fix it and thanks for all of you
    here’s the mistake I made:
    PermissionError Traceback (most recent call last)
    in ()
    1 # extract features from all images
    2 directory = ‘Flicker8k_Dataset’
    —-> 3 features = extract_features(directory)
    4 print(‘Extracted Features: %d’ % len(features))
    5 # save to file

    in extract_features(directory)
    13 # load an image from file
    14 filename = directory + ‘/’ + name
    —> 15 image = load_img(filename, target_size=(224, 224))
    16 # convert the image pixels to a numpy array
    17 image = img_to_array(image)

    ~\Anaconda3\envs\envir1\lib\site-packages\keras\preprocessing\ in load_img(path, grayscale, target_size, interpolation)
    360 raise ImportError(‘Could not import PIL.Image. ‘
    361 ‘The use of array_to_img requires PIL.’)
    –> 362 img =
    363 if grayscale:
    364 if img.mode != ‘L’:

    ~\Anaconda3\envs\envir1\lib\site-packages\PIL\ in open(fp, mode)
    2547 if filename:
    -> 2548 fp =, “rb”)
    2549 exclusive_fp = True

    PermissionError: [Errno 13] Permission denied: ‘Flicker8k_Dataset/Flicker8k_Dataset’

    • Jason Brownlee April 12, 2018 at 8:47 am #

      Looks like the dataset is missing or is not available on your workstation.

  50. Seaf April 13, 2018 at 1:33 am #

    Hello sir, Thanks for your effort

    I have trained the data using progressive loading and my machine restarted after 11 iterations,
    how can i continue training from that checkpoint ?

    • Jason Brownlee April 13, 2018 at 6:42 am #

      Load the last saved model, then continue training. As simple as that.

      I doubt more than a handful of epochs is required on this problem.

      • Seaf April 13, 2018 at 12:44 pm #

        thank you !

        i have loaded the last model (‘model_11.h5’) that has 3.445 loss, now it continue training with 5.4461 loss, is that normal ?

        • Jason Brownlee April 13, 2018 at 3:32 pm #

          Interesting, that is a little surprising. I wonder if there is a fault or if indeed the model loss has gotten worse.

          Some careful experiments may be required.

  51. Belgaroui April 13, 2018 at 3:07 am #

    Thank you, I think so too….

    I already downloaded Flicker8k_Datasets and extracted it in the same file where I work with jupyter notebook.

    I consulted Google and Youtube to try to fix this error but in vain…

    I don’t know but could you be so kind as to direct me and help me fix the problem.
    Thank you very much for your efforts…

    • Jason Brownlee April 13, 2018 at 6:43 am #

      What problem?

      • Belgaroui April 14, 2018 at 12:46 am #

        Hi Jason,
        when I try to compile code related to the extracted features from all images I get this error that is “Permission denied” you told me earlier that Looks like the dataset is missing or is not available on my workstation I tried to fix the trick but in vain.
        Do you have any idea how I could do that?
        Do I need a user right or something like that?
        or maybe I need to reload the database?

        *the error :
        ~\Anaconda3\envs\envir1\lib\site-packages\PIL\ in open(fp, mode)2546
        2547 if filename:
        -> 2548 fp =, “rb”)
        2549 exclusive_fp = True

        PermissionError: [Errno 13] Permission denied: ‘Flicker8k_Dataset/Flicker8k_Dataset’

        thanks a lot 🙂 🙂

        • Jason Brownlee April 14, 2018 at 6:47 am #

          You appear to have a problem loading the data from your hard drive. Perhaps you stored the data in a location where you/your code does not have permission to read?

          Perhaps you are using a notebook or an IDE as another user?

          Try running from the command line and check file permissions.

  52. @nkish April 14, 2018 at 4:56 pm #

    Thanks Jason. I really appreciate your knowledge and the way you express it to us through your articles, it’s amazing.

  53. Abdallah April 14, 2018 at 7:15 pm #

    Thank you very much mr.jason but I have some problems after download the pretrained model when make the model prediction

    FailedPreconditionError Traceback (most recent call last)
    ~/.local/lib/python3.6/site-packages/tensorflow/python/client/ in _do_call(self, fn, *args)
    1349 try:
    -> 1350 return fn(*args)
    1351 except errors.OpError as e:

    ~/.local/lib/python3.6/site-packages/tensorflow/python/client/ in _run_fn(session, feed_dict, fetch_list, target_list, options, run_metadata)
    1328 feed_dict, fetch_list, target_list,
    -> 1329 status, run_metadata)

    ~/.local/lib/python3.6/site-packages/tensorflow/python/framework/ in __exit__(self, type_arg, value_arg, traceback_arg)
    472 compat.as_text(c_api.TF_Message(self.status.status)),
    –> 473 c_api.TF_GetCode(self.status.status))
    474 # Delete the underlying status object from memory otherwise it stays alive

    FailedPreconditionError: Attempting to use uninitialized value block1_conv2_5/kernel
    [[Node: block1_conv2_5/kernel/read = Identity[T=DT_FLOAT, _class=[“loc:@block1_conv2_5/kernel”], _device=”/job:localhost/replica:0/task:0/device:CPU:0″](block1_conv2_5/kernel)]]

    During handling of the above exception, another exception occurred:

    FailedPreconditionError Traceback (most recent call last)
    in ()
    24 return features
    25 directory = ‘../ProjectPattern/Flickr8k_Dataset/Flicker8k_Dataset’
    —> 26 features =extract_feature(directory)
    27 dump(features,open(“feature.pkl”,”wb”))

    in extract_feature(directory)
    17 img =preprocess_input(img)
    18 #extract feature by make prediction use the pretrained model
    —> 19 feature = model.predict(img,verbose=0)
    20 #extract img_id
    21 img_id = name.split(‘.’)[0]

    ~/.local/lib/python3.6/site-packages/tensorflow/python/keras/_impl/keras/engine/ in predict(self, x, batch_size, verbose, steps)
    1811 f = self.predict_function
    1812 return self._predict_loop(
    -> 1813 f, ins, batch_size=batch_size, verbose=verbose, steps=steps)
    1815 def train_on_batch(self, x, y, sample_weight=None, class_weight=None):

    ~/.local/lib/python3.6/site-packages/tensorflow/python/keras/_impl/keras/engine/ in _predict_loop(self, f, ins, batch_size, verbose, steps)
    1306 else:
    1307 ins_batch = _slice_arrays(ins, batch_ids)
    -> 1308 batch_outs = f(ins_batch)
    1309 if not isinstance(batch_outs, list):
    1310 batch_outs = [batch_outs]

    ~/.local/lib/python3.6/site-packages/tensorflow/python/keras/_impl/keras/ in __call__(self, inputs)
    2551 session = get_session()
    2552 updated =
    -> 2553 fetches=fetches, feed_dict=feed_dict, **self.session_kwargs)
    2554 return updated[:len(self.outputs)]

    ~/.local/lib/python3.6/site-packages/tensorflow/python/client/ in run(self, fetches, feed_dict, options, run_metadata)
    893 try:
    894 result = self._run(None, fetches, feed_dict, options_ptr,
    –> 895 run_metadata_ptr)
    896 if run_metadata:
    897 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

    ~/.local/lib/python3.6/site-packages/tensorflow/python/client/ in _run(self, handle, fetches, feed_dict, options, run_metadata)
    1126 if final_fetches or final_targets or (handle and feed_dict_tensor):
    1127 results = self._do_run(handle, final_targets, final_fetches,
    -> 1128 feed_dict_tensor, options, run_metadata)
    1129 else:
    1130 results = []

    ~/.local/lib/python3.6/site-packages/tensorflow/python/client/ in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
    1342 if handle is None:
    1343 return self._do_call(_run_fn, self._session, feeds, fetches, targets,
    -> 1344 options, run_metadata)
    1345 else:
    1346 return self._do_call(_prun_fn, self._session, handle, feeds, fetches)

    ~/.local/lib/python3.6/site-packages/tensorflow/python/client/ in _do_call(self, fn, *args)
    1361 except KeyError:
    1362 pass
    -> 1363 raise type(e)(node_def, op, message)
    1365 def _extend_graph(self):

    FailedPreconditionError: Attempting to use uninitialized value block1_conv2_5/kernel
    [[Node: block1_conv2_5/kernel/read = Identity[T=DT_FLOAT, _class=[“loc:@block1_conv2_5/kernel”], _device=”/job:localhost/replica:0/task:0/device:CPU:0″](block1_conv2_5/kernel)]]

    Caused by op ‘block1_conv2_5/kernel/read’, defined at:
    File “/usr/lib/python3.6/”, line 193, in _run_module_as_main
    “__main__”, mod_spec)
    File “/usr/lib/python3.6/”, line 85, in _run_code
    exec(code, run_globals)
    File “/home/abdo96/.local/lib/python3.6/site-packages/”, line 16, in
    File “/home/abdo96/.local/lib/python3.6/site-packages/traitlets/config/”, line 658, in launch_instance
    File “/home/abdo96/.local/lib/python3.6/site-packages/ipykernel/”, line 478, in start
    File “/home/abdo96/.local/lib/python3.6/site-packages/zmq/eventloop/”, line 177, in start
    super(ZMQIOLoop, self).start()
    File “/home/abdo96/.local/lib/python3.6/site-packages/tornado/”, line 888, in start
    handler_func(fd_obj, events)
    File “/home/abdo96/.local/lib/python3.6/site-packages/tornado/”, line 277, in null_wrapper
    return fn(*args, **kwargs)
    File “/home/abdo96/.local/lib/python3.6/site-packages/zmq/eventloop/”, line 440, in _handle_events
    File “/home/abdo96/.local/lib/python3.6/site-packages/zmq/eventloop/”, line 472, in _handle_recv
    self._run_callback(callback, msg)
    File “/home/abdo96/.local/lib/python3.6/site-packages/zmq/eventloop/”, line 414, in _run_callback
    callback(*args, **kwargs)
    File “/home/abdo96/.local/lib/python3.6/site-packages/tornado/”, line 277, in null_wrapper
    return fn(*args, **kwargs)
    File “/home/abdo96/.local/lib/python3.6/site-packages/ipykernel/”, line 283, in dispatcher
    return self.dispatch_shell(stream, msg)
    File “/home/abdo96/.local/lib/python3.6/site-packages/ipykernel/”, line 233, in dispatch_shell
    handler(stream, idents, msg)
    File “/home/abdo96/.local/lib/python3.6/site-packages/ipykernel/”, line 399, in execute_request
    user_expressions, allow_stdin)
    File “/home/abdo96/.local/lib/python3.6/site-packages/ipykernel/”, line 208, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
    File “/home/abdo96/.local/lib/python3.6/site-packages/ipykernel/”, line 537, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
    File “/home/abdo96/.local/lib/python3.6/site-packages/IPython/core/”, line 2728, in run_cell
    interactivity=interactivity, compiler=compiler, result=result)
    File “/home/abdo96/.local/lib/python3.6/site-packages/IPython/core/”, line 2850, in run_ast_nodes
    if self.run_code(code, result):
    File “/home/abdo96/.local/lib/python3.6/site-packages/IPython/core/”, line 2910, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
    File “”, line 26, in
    features =extract_feature(directory)
    File “”, line 2, in extract_feature
    model = VGG19()
    File “/home/abdo96/.local/lib/python3.6/site-packages/keras/applications/”, line 117, in VGG19
    x = Conv2D(64, (3, 3), activation=’relu’, padding=’same’, name=’block1_conv2′)(x)
    File “/home/abdo96/.local/lib/python3.6/site-packages/keras/engine/”, line 590, in __call__[0])
    File “/home/abdo96/.local/lib/python3.6/site-packages/keras/layers/”, line 138, in build
    File “/home/abdo96/.local/lib/python3.6/site-packages/keras/legacy/”, line 91, in wrapper
    return func(*args, **kwargs)
    File “/home/abdo96/.local/lib/python3.6/site-packages/keras/engine/”, line 414, in add_weight
    File “/home/abdo96/.local/lib/python3.6/site-packages/keras/backend/”, line 392, in variable
    v = tf.Variable(value, dtype=tf.as_dtype(dtype), name=name)
    File “/home/abdo96/.local/lib/python3.6/site-packages/tensorflow/python/ops/”, line 229, in __init__
    File “/home/abdo96/.local/lib/python3.6/site-packages/tensorflow/python/ops/”, line 376, in _init_from_args
    self._snapshot = array_ops.identity(self._variable, name=”read”)
    File “/home/abdo96/.local/lib/python3.6/site-packages/tensorflow/python/ops/”, line 127, in identity
    return gen_array_ops.identity(input, name=name)
    File “/home/abdo96/.local/lib/python3.6/site-packages/tensorflow/python/ops/”, line 2134, in identity
    “Identity”, input=input, name=name)
    File “/home/abdo96/.local/lib/python3.6/site-packages/tensorflow/python/framework/”, line 787, in _apply_op_helper
    File “/home/abdo96/.local/lib/python3.6/site-packages/tensorflow/python/framework/”, line 3160, in create_op
    File “/home/abdo96/.local/lib/python3.6/site-packages/tensorflow/python/framework/”, line 1625, in __init__
    self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

    FailedPreconditionError (see above for traceback): Attempting to use uninitialized value block1_conv2_5/kernel
    [[Node: block1_conv2_5/kernel/read = Identity[T=DT_FLOAT, _class=[“loc:@block1_conv2_5/kernel”], _device=”/job:localhost/replica:0/task:0/device:CPU:0″](block1_conv2_5/kernel)]]

    • Jason Brownlee April 15, 2018 at 6:25 am #

      Wow. I have not seen this before, sorry.

      Perhaps try searching or posting on stackoverflow?

      • Abdallah April 17, 2018 at 9:18 pm #

        so the problem solved by specifying which weights used not None(random initialization)
        but used pretraining on ‘imagenet’ and specify the include_top argument to be True

  54. Abdallah April 15, 2018 at 9:45 am #

    When using Merged input in model the error below showed
    Thanks in advance

    in ()
    29 plot_model(model,to_file=’model.png’,show_shapes=True,show_layer_names=True)
    30 return model
    —> 31 define_model(vocab_size,max_len)

    in define_model(vocab_size, max_length)
    26 model = Model(inputs=[input1,input2],outputs=output)
    —> 28 model.compile(loss=’categorical_crossentropy’,optimizer=’Adam’)(mask)
    29 plot_model(model,to_file=’model.png’,show_shapes=True,show_layer_names=True)
    30 return model

    ~/.local/lib/python3.6/site-packages/tensorflow/python/keras/_impl/keras/engine/ in compile(self, optimizer, loss, metrics, loss_weights, sample_weight_mode, weighted_metrics, target_tensors, **kwargs)
    680 # Prepare output masks.
    –> 681 masks = self.compute_mask(self.inputs, mask=None)
    682 if masks is None:
    683 masks = [None for _ in self.outputs]

    ~/.local/lib/python3.6/site-packages/tensorflow/python/keras/_impl/keras/engine/ in compute_mask(self, inputs, mask)
    785 return self._output_mask_cache[cache_key]
    786 else:
    –> 787 _, output_masks = self._run_internal_graph(inputs, masks)
    788 return output_masks

    ~/.local/lib/python3.6/site-packages/tensorflow/python/layers/ in _run_internal_graph(self, inputs, masks)
    897 # Apply activity regularizer if any:
    –> 898 if layer.activity_regularizer is not None:
    899 regularization_losses = [
    900 layer.activity_regularizer(x) for x in computed_tensors

    AttributeError: ‘InputLayer’ object has no attribute ‘activity_regularizer’

    • Jason Brownlee April 16, 2018 at 6:01 am #

      What version of Keras are you using?

      Did you copy all of the code exactly?

      • Abdallah April 16, 2018 at 7:07 pm #

        I used verison 2.1.5
        the another question No, I didn’t copy all the code exactly but I understand the idea and imitate it in some parts and in other parts are written in my own

        • Jason Brownlee April 17, 2018 at 5:56 am #

          Sorry, I cannot help you debug your own modifications.

          • Abdallah April 17, 2018 at 9:10 pm #

            I wrote this problem in the stack overflow but no one answer so I will try to fix this problem in my own Thank you for your answers

          • Jason Brownlee April 18, 2018 at 8:04 am #

            Hang in there.

  55. prateek bansal April 21, 2018 at 4:03 pm #

    Hi, jason brownlee thanks for this fatanstic article.
    I am curios to know that how he while loop is getting stopped in progressive training data genertor function ?
    Please explain this to me

    def data_generator(descriptions, photos, tokenizer, max_length):

    # loop for ever over images
    while 1:
    for key, desc_list