How to Prepare a Photo Caption Dataset for Training a Deep Learning Model

By Jason Brownlee on August 7, 2019 in Deep Learning for Natural Language Processing 78

Automatic photo captioning is a problem where a model must generate a human-readable textual description given a photograph.

It is a challenging problem in artificial intelligence that requires both image understanding from the field of computer vision as well as language generation from the field of natural language processing.

It is now possible to develop your own image caption models using deep learning and freely available datasets of photos and their descriptions.

In this tutorial, you will discover how to prepare photos and textual descriptions ready for developing a deep learning automatic photo caption generation model.

After completing this tutorial, you will know:

About the Flickr8K dataset comprised of more than 8,000 photos and up to 5 captions for each photo.
How to generally load and prepare photo and text data for modeling with deep learning.
How to specifically encode data for two different types of deep learning models in Keras.

Kick-start your project with my new book Deep Learning for Natural Language Processing, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Update Nov/2017: Fixed small typos in the code in the “Whole Description Sequence Model” section. Thanks Moustapha Cheikh and Matthew.
Update Feb/2019: Provided direct links for the Flickr8k_Dataset dataset, as the official site was taken down.

How to Prepare a Photo Caption Dataset for Training a Deep Learning Model
Photo by beverlyislike, some rights reserved.

Tutorial Overview

This tutorial is divided into 9 parts; they are:

Download the Flickr8K Dataset
How to Load Photographs
Pre-Calculate Photo Features
How to Load Descriptions
Prepare Description Text
Whole Description Sequence Model
Word-By-Word Model
Progressive Loading
Pre-Calculate Photo Features

Python Environment

This tutorial assumes you have a Python 3 SciPy environment installed. You can use Python 2, but you may need to change some of the examples.

You must have Keras (2.0 or higher) installed with either the TensorFlow or Theano backend.

The tutorial also assumes you have scikit-learn, Pandas, NumPy and Matplotlib installed.

If you need help with your environment, see this post:

How to Setup a Python Environment for Machine Learning and Deep Learning with Anaconda

Need help with Deep Learning for Text Data?

Take my free 7-day email crash course now (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download the Flickr8K Dataset

A good dataset to use when getting started with image captioning is the Flickr8K dataset.

The reason is that it is realistic and relatively small so that you can download it and build models on your workstation using a CPU.

The definitive description of the dataset is in the paper “Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics” from 2013.

The authors describe the dataset as follows:

We introduce a new benchmark collection for sentence-based image description and search, consisting of 8,000 images that are each paired with five different captions which provide clear descriptions of the salient entities and events.

…

The images were chosen from six different Flickr groups, and tend not to contain any well-known people or locations, but were manually selected to depict a variety of scenes and situations.

— Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics, 2013.

The dataset is available for free. You must complete a request form and the links to the dataset will be emailed to you. I would love to link to them for you, but the email address expressly requests: “Please do not redistribute the dataset“.

You can use the link below to request the dataset:

Dataset Request Form

Within a short time, you will receive an email that contains links to two files:

Flickr8k_Dataset.zip (1 Gigabyte) An archive of all photographs.
Flickr8k_text.zip (2.2 Megabytes) An archive of all text descriptions for photographs.

UPDATE (Feb/2019): The official site seems to have been taken down (although the form still works). Here are some direct download links from my datasets GitHub repository:

Download the datasets and unzip them into your current working directory. You will have two directories:

Flicker8k_Dataset: Contains 8092 photographs in jpeg format.
Flickr8k_text: Contains a number of files containing different sources of descriptions for the photographs.

Next, let’s look at how to load the images.

How to Load Photographs

In this section, we will develop some code to load the photos for use with the Keras deep learning library in Python.

The image file names are unique image identifiers. For example, here is a sample of image file names:

990890291_afc72be141.jpg
99171998_7cc800ceef.jpg
99679241_adc853a5c0.jpg
997338199_7343367d7f.jpg
997722733_0cb5439472.jpg

990890291_afc72be141.jpg

99171998_7cc800ceef.jpg

99679241_adc853a5c0.jpg

997338199_7343367d7f.jpg

997722733_0cb5439472.jpg

Keras provides the load_img() function that can be used to load the image files directly as an array of pixels.

from keras.preprocessing.image import load_img
image = load_img('990890291_afc72be141.jpg')

1 2	from keras.preprocessing.image import load_img image = load_img('990890291_afc72be141.jpg')

The pixel data needs to be converted to a NumPy array for use in Keras.

We can use the img_to_array() keras function to convert the loaded data.

from keras.preprocessing.image import img_to_array
image = img_to_array(image)

1 2	from keras.preprocessing.image import img_to_array image = img_to_array(image)

We may want to use a pre-defined feature extraction model, such as a state-of-the-art deep image classification network trained on Image net. The Oxford Visual Geometry Group (VGG) model is popular for this purpose and is available in Keras.

The Oxford Visual Geometry Group (VGG) model is popular for this purpose and is available in Keras.

If we decide to use this pre-trained model as a feature extractor in our model, we can preprocess the pixel data for the model by using the preprocess_input() function in Keras, for example:

from keras.applications.vgg16 import preprocess_input

# reshape data into a single sample of an image
image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
# prepare the image for the VGG model
image = preprocess_input(image)

from keras.applications.vgg16 import preprocess_input

# reshape data into a single sample of an image

image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))

# prepare the image for the VGG model

image = preprocess_input(image)

We may also want to force the loading of the photo to have the same pixel dimensions as the VGG model, which are 224 x 224 pixels. We can do that in the call to load_img(), for example:

image = load_img('990890291_afc72be141.jpg', target_size=(224, 224))

1	image = load_img('990890291_afc72be141.jpg', target_size=(224, 224))

We may want to extract the unique image identifier from the image filename. We can do that by splitting the filename string by the ‘.’ (period) character and retrieving the first element of the resulting array:

image_id = filename.split('.')[0]

1	image_id = filename.split('.')[0]

We can tie all of this together and develop a function that, given the name of the directory containing the photos, will load and pre-process all of the photos for the VGG model and return them in a dictionary keyed on their unique image identifiers.

from os import listdir
from keras.preprocessing.image import load_img
from keras.preprocessing.image import img_to_array
from keras.applications.vgg16 import preprocess_input

def load_photos(directory):
	images = dict()
	for name in listdir(directory):
		# load an image from file
		filename = directory + '/' + name
		image = load_img(filename, target_size=(224, 224))
		# convert the image pixels to a numpy array
		image = img_to_array(image)
		# reshape data for the model
		image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
		# prepare the image for the VGG model
		image = preprocess_input(image)
		# get image id
		image_id = name.split('.')[0]
		images[image_id] = image
	return images

# load images
directory = 'Flicker8k_Dataset'
images = load_photos(directory)
print('Loaded Images: %d' % len(images))

from os import listdir

from keras.preprocessing.image import load_img

from keras.preprocessing.image import img_to_array

from keras.applications.vgg16 import preprocess_input

def load_photos(directory):

images = dict()

for name in listdir(directory):

# load an image from file

filename = directory + '/' + name

image = load_img(filename, target_size=(224, 224))

# convert the image pixels to a numpy array

image = img_to_array(image)

# reshape data for the model

image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))

# prepare the image for the VGG model

image = preprocess_input(image)

# get image id

image_id = name.split('.')[0]

images[image_id] = image

return images

# load images

directory = 'Flicker8k_Dataset'

images = load_photos(directory)

print('Loaded Images: %d' % len(images))

Running this example prints the number of loaded images. It takes a few minutes to run.

Loaded Images: 8091

1	Loaded Images: 8091

If you do not have the RAM to hold all images (about 5GB by my estimation), then you can add an if-statement to break the loop early after 100 images have been loaded, for example:

if (len(images) >= 100):
	break

1 2	if (len(images) >= 100): break

Pre-Calculate Photo Features

It is possible to use a pre-trained model to extract the features from photos in the dataset and store the features to file.

This is an efficiency that means that the language part of the model that turns features extracted from the photo into textual descriptions can be trained standalone from the feature extraction model. The benefit is that the very large pre-trained models do not need to be loaded, held in memory, and used to process each photo while training the language model.

Later, the feature extraction model and language model can be put back together for making predictions on new photos.

In this section, we will extend the photo loading behavior developed in the previous section to load all photos, extract their features using a pre-trained VGG model, and store the extracted features to a new file that can be loaded and used to train the language model.

The first step is to load the VGG model. This model is provided directly in Keras and can be loaded as follows. Note that this will download the 500-megabyte model weights to your computer, which may take a few minutes.

from keras.applications.vgg16 import VGG16
# load the model
in_layer = Input(shape=(224, 224, 3))
model = VGG16(include_top=False, input_tensor=in_layer, pooling='avg')
print(model.summary())

from keras.applications.vgg16 import VGG16

# load the model

in_layer = Input(shape=(224, 224, 3))

model = VGG16(include_top=False, input_tensor=in_layer, pooling='avg')

print(model.summary())

This will load the VGG 16-layer model.

The two Dense output layers as well as the classification output layer are removed from the model by setting include_top=False. The output from the final pooling layer is taken as the features extracted from the image.

Next, we can walk over all images in the directory of images as in the previous section and call predict() function on the model for each prepared image to get the extracted features. The features can then be stored in a dictionary keyed on the image id.

The complete example is listed below.

from os import listdir
from pickle import dump
from keras.applications.vgg16 import VGG16
from keras.preprocessing.image import load_img
from keras.preprocessing.image import img_to_array
from keras.applications.vgg16 import preprocess_input
from keras.layers import Input

# extract features from each photo in the directory
def extract_features(directory):
	# load the model
	in_layer = Input(shape=(224, 224, 3))
	model = VGG16(include_top=False, input_tensor=in_layer)
	print(model.summary())
	# extract features from each photo
	features = dict()
	for name in listdir(directory):
		# load an image from file
		filename = directory + '/' + name
		image = load_img(filename, target_size=(224, 224))
		# convert the image pixels to a numpy array
		image = img_to_array(image)
		# reshape data for the model
		image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
		# prepare the image for the VGG model
		image = preprocess_input(image)
		# get features
		feature = model.predict(image, verbose=0)
		# get image id
		image_id = name.split('.')[0]
		# store feature
		features[image_id] = feature
		print('>%s' % name)
	return features

# extract features from all images
directory = 'Flicker8k_Dataset'
features = extract_features(directory)
print('Extracted Features: %d' % len(features))
# save to file
dump(features, open('features.pkl', 'wb'))

from os import listdir

from pickle import dump

from keras.applications.vgg16 import VGG16

from keras.preprocessing.image import load_img

from keras.preprocessing.image import img_to_array

from keras.applications.vgg16 import preprocess_input

from keras.layers import Input

# extract features from each photo in the directory

def extract_features(directory):

# load the model

in_layer = Input(shape=(224, 224, 3))

model = VGG16(include_top=False, input_tensor=in_layer)

print(model.summary())

# extract features from each photo

features = dict()

for name in listdir(directory):

# load an image from file

filename = directory + '/' + name

image = load_img(filename, target_size=(224, 224))

# convert the image pixels to a numpy array

image = img_to_array(image)

# reshape data for the model

image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))

# prepare the image for the VGG model

image = preprocess_input(image)

# get features

feature = model.predict(image, verbose=0)

# get image id

image_id = name.split('.')[0]

# store feature

features[image_id] = feature

print('>%s' % name)

return features

# extract features from all images

directory = 'Flicker8k_Dataset'

features = extract_features(directory)

print('Extracted Features: %d' % len(features))

# save to file

dump(features, open('features.pkl', 'wb'))

The example may take some time to complete, perhaps one hour.

After all features are extracted, the dictionary is stored in the file ‘features.pkl‘ in the current working directory.

These features can then be loaded later and used as input for training a language model.

You could experiment with other types of pre-trained models in Keras.

How to Load Descriptions

It is important to take a moment to talk about the descriptions; there are a number available.

The file Flickr8k.token.txt contains a list of image identifiers (used in the image filenames) and tokenized descriptions. Each image has multiple descriptions.

Below is a sample of the descriptions from the file showing 5 different descriptions for a single image.

1305564994_00513f9a5b.jpg#0 A man in street racer armor be examine the tire of another racer 's motorbike .
1305564994_00513f9a5b.jpg#1 Two racer drive a white bike down a road .
1305564994_00513f9a5b.jpg#2 Two motorist be ride along on their vehicle that be oddly design and color .
1305564994_00513f9a5b.jpg#3 Two person be in a small race car drive by a green hill .
1305564994_00513f9a5b.jpg#4 Two person in race uniform in a street car .

1305564994_00513f9a5b.jpg#0 A man in street racer armor be examine the tire of another racer 's motorbike .

1305564994_00513f9a5b.jpg#1 Two racer drive a white bike down a road .

1305564994_00513f9a5b.jpg#2 Two motorist be ride along on their vehicle that be oddly design and color .

1305564994_00513f9a5b.jpg#3 Two person be in a small race car drive by a green hill .

1305564994_00513f9a5b.jpg#4 Two person in race uniform in a street car .

The file ExpertAnnotations.txt indicates which of the descriptions for each image were written by “experts” which were written by crowdsource workers asked to describe the image.

Finally, the file CrowdFlowerAnnotations.txt provides the frequency of crowd workers that indicate whether captions suit each image. These frequencies can be interpreted probabilistically.

The authors of the paper describe the annotations as follows:

… annotators were asked to write sentences that describe the depicted scenes, situations, events and entities (people, animals, other objects). We collected multiple captions for each image because there is a considerable degree of variance in the way many images can be described.

— Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics, 2013.

There are also lists of the photo identifiers to use in a train/test split so that you can compare results reported in the paper.

The first step is to decide which captions to use. The simplest approach is to use the first description for each photograph.

First, we need a function to load the entire annotations file (‘Flickr8k.token.txt‘) into memory. Below is a function to do this called load_doc() that, given a filename, will return the document as a string.

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# load doc into memory

def load_doc(filename):

# open the file as read only

file = open(filename, 'r')

# read all text

text = file.read()

# close the file

file.close()

return text

We can see from the sample of the file above that we need only split each line by white space and take the first element as the image identifier and the rest as the image description. For example:

# split line by white space
tokens = line.split()
# take the first token as the image id, the rest as the description
image_id, image_desc = tokens[0], tokens[1:]

# split line by white space

tokens = line.split()

# take the first token as the image id, the rest as the description

image_id, image_desc = tokens[0], tokens[1:]

We can then clean up the image identifier by removing the filename extension and the description number.

# remove filename from image id
image_id = image_id.split('.')[0]

1 2	# remove filename from image id image_id = image_id.split('.')[0]

We can also put the description tokens back together into a string for later processing.

# convert description tokens back to string
image_desc = ' '.join(image_desc)

1 2	# convert description tokens back to string image_desc = ' '.join(image_desc)

We can put all of this together into a function.

Below defines the load_descriptions() function that will take the loaded file, process it line-by-line, and return a dictionary of image identifiers to their first description.

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# extract descriptions for images
def load_descriptions(doc):
	mapping = dict()
	# process lines
	for line in doc.split('\n'):
		# split line by white space
		tokens = line.split()
		if len(line) < 2:
			continue
		# take the first token as the image id, the rest as the description
		image_id, image_desc = tokens[0], tokens[1:]
		# remove filename from image id
		image_id = image_id.split('.')[0]
		# convert description tokens back to string
		image_desc = ' '.join(image_desc)
		# store the first description for each image
		if image_id not in mapping:
			mapping[image_id] = image_desc
	return mapping

filename = 'Flickr8k_text/Flickr8k.token.txt'
doc = load_doc(filename)
descriptions = load_descriptions(doc)
print('Loaded: %d ' % len(descriptions))

# load doc into memory

def load_doc(filename):

# open the file as read only

file = open(filename, 'r')

# read all text

text = file.read()

# close the file

file.close()

return text

# extract descriptions for images

def load_descriptions(doc):

mapping = dict()

# process lines

for line in doc.split('\n'):

# split line by white space

tokens = line.split()

if len(line) < 2:

continue

# take the first token as the image id, the rest as the description

image_id, image_desc = tokens[0], tokens[1:]

# remove filename from image id

image_id = image_id.split('.')[0]

# convert description tokens back to string

image_desc = ' '.join(image_desc)

# store the first description for each image

if image_id not in mapping:

mapping[image_id] = image_desc

return mapping

filename = 'Flickr8k_text/Flickr8k.token.txt'

doc = load_doc(filename)

descriptions = load_descriptions(doc)

print('Loaded: %d ' % len(descriptions))

Running the example prints the number of loaded image descriptions.

Loaded: 8092

1	Loaded: 8092

There are other ways to load descriptions that may turn out to be more accurate for the data.

Use the above example as a starting point and let me know what you come up with.
Post your approach in the comments below.

Prepare Description Text

The descriptions are tokenized; this means that each token is comprised of words separated by white space.

It also means that punctuation are separated as their own tokens, such as periods (‘.’) and apostrophes for word plurals (‘s).

It is a good idea to clean up the description text before using it in a model. Some ideas of data cleaning we can form include:

Normalizing the case of all tokens to lowercase.
Remove all punctuation from tokens.
Removing all tokens that contain one or fewer characters (after punctuation is removed), e.g. ‘a’ and hanging ‘s’ characters.

We can implement these simple cleaning operations in a function that cleans each description in the loaded dictionary from the previous section. Below defines the clean_descriptions() function that will clean each loaded description.

# clean description text
def clean_descriptions(descriptions):
	# prepare translation table for removing punctuation
	table = str.maketrans('', '', string.punctuation)
	for key, desc in descriptions.items():
		# tokenize
		desc = desc.split()
		# convert to lower case
		desc = [word.lower() for word in desc]
		# remove punctuation from each token
		desc = [w.translate(table) for w in desc]
		# remove hanging 's' and 'a'
		desc = [word for word in desc if len(word)>1]
		# store as string
		descriptions[key] =  ' '.join(desc)

# clean description text

def clean_descriptions(descriptions):

# prepare translation table for removing punctuation

table = str.maketrans('', '', string.punctuation)

for key, desc in descriptions.items():

# tokenize

desc = desc.split()

# convert to lower case

desc = [word.lower() for word in desc]

# remove punctuation from each token

desc = [w.translate(table) for w in desc]

# remove hanging 's' and 'a'

desc = [word for word in desc if len(word)>1]

# store as string

descriptions[key] = ' '.join(desc)

We can then save the clean text to file for later use by our model.

Each line will contain the image identifier followed by the clean description. Below defines the save_doc() function for saving the cleaned descriptions to file.

# save descriptions to file, one per line
def save_doc(descriptions, filename):
	lines = list()
	for key, desc in descriptions.items():
		lines.append(key + ' ' + desc)
	data = '\n'.join(lines)
	file = open(filename, 'w')
	file.write(data)
	file.close()

# save descriptions to file, one per line

def save_doc(descriptions, filename):

lines = list()

for key, desc in descriptions.items():

lines.append(key + ' ' + desc)

data = '\n'.join(lines)

file = open(filename, 'w')

file.write(data)

file.close()

Putting this all together with the loading of descriptions from the previous section, the complete example is listed below.

import string

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# extract descriptions for images
def load_descriptions(doc):
	mapping = dict()
	# process lines
	for line in doc.split('\n'):
		# split line by white space
		tokens = line.split()
		if len(line) < 2:
			continue
		# take the first token as the image id, the rest as the description
		image_id, image_desc = tokens[0], tokens[1:]
		# remove filename from image id
		image_id = image_id.split('.')[0]
		# convert description tokens back to string
		image_desc = ' '.join(image_desc)
		# store the first description for each image
		if image_id not in mapping:
			mapping[image_id] = image_desc
	return mapping

# clean description text
def clean_descriptions(descriptions):
	# prepare translation table for removing punctuation
	table = str.maketrans('', '', string.punctuation)
	for key, desc in descriptions.items():
		# tokenize
		desc = desc.split()
		# convert to lower case
		desc = [word.lower() for word in desc]
		# remove punctuation from each token
		desc = [w.translate(table) for w in desc]
		# remove hanging 's' and 'a'
		desc = [word for word in desc if len(word)>1]
		# store as string
		descriptions[key] =  ' '.join(desc)

# save descriptions to file, one per line
def save_doc(descriptions, filename):
	lines = list()
	for key, desc in descriptions.items():
		lines.append(key + ' ' + desc)
	data = '\n'.join(lines)
	file = open(filename, 'w')
	file.write(data)
	file.close()

filename = 'Flickr8k_text/Flickr8k.token.txt'
# load descriptions
doc = load_doc(filename)
# parse descriptions
descriptions = load_descriptions(doc)
print('Loaded: %d ' % len(descriptions))
# clean descriptions
clean_descriptions(descriptions)
# summarize vocabulary
all_tokens = ' '.join(descriptions.values()).split()
vocabulary = set(all_tokens)
print('Vocabulary Size: %d' % len(vocabulary))
# save descriptions
save_doc(descriptions, 'descriptions.txt')

import string

# load doc into memory

def load_doc(filename):

# open the file as read only

file = open(filename, 'r')

# read all text

text = file.read()

# close the file

file.close()

return text

# extract descriptions for images

def load_descriptions(doc):

mapping = dict()

# process lines

for line in doc.split('\n'):

# split line by white space

tokens = line.split()

if len(line) < 2:

continue

# take the first token as the image id, the rest as the description

image_id, image_desc = tokens[0], tokens[1:]

# remove filename from image id

image_id = image_id.split('.')[0]

# convert description tokens back to string

image_desc = ' '.join(image_desc)

# store the first description for each image

if image_id not in mapping:

mapping[image_id] = image_desc

return mapping

# clean description text

def clean_descriptions(descriptions):

# prepare translation table for removing punctuation

table = str.maketrans('', '', string.punctuation)

for key, desc in descriptions.items():

# tokenize

desc = desc.split()

# convert to lower case

desc = [word.lower() for word in desc]

# remove punctuation from each token

desc = [w.translate(table) for w in desc]

# remove hanging 's' and 'a'

desc = [word for word in desc if len(word)>1]

# store as string

descriptions[key] = ' '.join(desc)

# save descriptions to file, one per line

def save_doc(descriptions, filename):

lines = list()

for key, desc in descriptions.items():

lines.append(key + ' ' + desc)

data = '\n'.join(lines)

file = open(filename, 'w')

file.write(data)

file.close()

filename = 'Flickr8k_text/Flickr8k.token.txt'

# load descriptions

doc = load_doc(filename)

# parse descriptions

descriptions = load_descriptions(doc)

print('Loaded: %d ' % len(descriptions))

# clean descriptions

clean_descriptions(descriptions)

# summarize vocabulary

all_tokens = ' '.join(descriptions.values()).split()

vocabulary = set(all_tokens)

print('Vocabulary Size: %d' % len(vocabulary))

# save descriptions

save_doc(descriptions, 'descriptions.txt')

Running the example first loads 8,092 descriptions, cleans them, summarizes the vocabulary of 4,484 unique words, then saves them to a new file called ‘descriptions.txt‘.

Loaded: 8092
Vocabulary Size: 4484

1 2	Loaded: 8092 Vocabulary Size: 4484

Open the new file ‘descriptions.txt‘ in a text editor and review the contents. You should see somewhat readable descriptions of photos ready for modeling.

...
3139118874_599b30b116 two girls pose for picture at christmastime
2065875490_a46b58c12b person is walking on sidewalk and skeleton is on the left inside of fence
2682382530_f9f8fd1e89 man in black shorts is stretching out his leg
3484019369_354e0b88c0 hockey team in red and white on the side of the ice rink
505955292_026f1489f2 boy rides horse

...

3139118874_599b30b116 two girls pose for picture at christmastime

2065875490_a46b58c12b person is walking on sidewalk and skeleton is on the left inside of fence

2682382530_f9f8fd1e89 man in black shorts is stretching out his leg

3484019369_354e0b88c0 hockey team in red and white on the side of the ice rink

505955292_026f1489f2 boy rides horse

The vocabulary is still relatively large. To make modeling easier, especially the first time around, I would recommend further reducing the vocabulary by removing words that only appear once or twice across all descriptions.

Whole Description Sequence Model

There are many ways to model the caption generation problem.

One naive way is to create a model that outputs the entire textual description in a one-shot manner.

This is a naive model because it puts a heavy burden on the model to both interpret the meaning of the photograph and generate words, then arrange those words into the correct order.

This is not unlike the language translation problem used in an Encoder-Decoder recurrent neural network where the entire translated sentence is output one word at a time given an encoding of the input sequence. Here we would use an encoding of the image to generate the output sentence instead.

The image may be encoded using a pre-trained model used for image classification, such as the VGG trained on the ImageNet model mentioned above.

The output of the model would be a probability distribution over each word in the vocabulary. The sequence would be as long as the longest photo description.

The descriptions would, therefore, need to be first integer encoded where each word in the vocabulary is assigned a unique integer and sequences of words would be replaced with sequences of integers. The integer sequences would then need to be one hot encoded to represent the idealized probability distribution over the vocabulary for each word in the sequence.

We can use tools in Keras to prepare the descriptions for this type of model.

The first step is to load the mapping of image identifiers to clean descriptions stored in ‘descriptions.txt‘.

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# load clean descriptions into memory
def load_clean_descriptions(filename):
	doc = load_doc(filename)
	descriptions = dict()
	for line in doc.split('\n'):
		# split line by white space
		tokens = line.split()
		# split id from description
		image_id, image_desc = tokens[0], tokens[1:]
		# store
		descriptions[image_id] = ' '.join(image_desc)
	return descriptions

descriptions = load_clean_descriptions('descriptions.txt')
print('Loaded %d' % (len(descriptions)))

# load doc into memory

def load_doc(filename):

# open the file as read only

file = open(filename, 'r')

# read all text

text = file.read()

# close the file

file.close()

return text

# load clean descriptions into memory

def load_clean_descriptions(filename):

doc = load_doc(filename)

descriptions = dict()

for line in doc.split('\n'):

# split line by white space

tokens = line.split()

# split id from description

image_id, image_desc = tokens[0], tokens[1:]

# store

descriptions[image_id] = ' '.join(image_desc)

return descriptions

descriptions = load_clean_descriptions('descriptions.txt')

print('Loaded %d' % (len(descriptions)))

Running this piece loads the 8,092 photo descriptions into a dictionary keyed on image identifiers. These identifiers can then be used to load each photo file for the corresponding inputs to the model.

Loaded 8092

1	Loaded 8092

Next, we need to extract all of the description text so we can encode it.

# extract all text
desc_text = list(descriptions.values())

1 2	# extract all text desc_text = list(descriptions.values())

We can use the Keras Tokenizer class to consistently map each word in the vocabulary to an integer. First, the object is created, then is fit on the description text. The fit tokenizer can later be saved to file for consistent decoding of the predictions back to vocabulary words.

from keras.preprocessing.text import Tokenizer
# prepare tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(desc_text)
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)

from keras.preprocessing.text import Tokenizer

# prepare tokenizer

tokenizer = Tokenizer()

tokenizer.fit_on_texts(desc_text)

vocab_size = len(tokenizer.word_index) + 1

print('Vocabulary Size: %d' % vocab_size)

Next, we can use the fit tokenizer to encode the photo descriptions into sequences of integers.

# integer encode descriptions
sequences = tokenizer.texts_to_sequences(desc_text)

1 2	# integer encode descriptions sequences = tokenizer.texts_to_sequences(desc_text)

The model will require all output sequences to have the same length for training. We can achieve this by padding all encoded sequences to have the same length as the longest encoded sequence. We can pad the sequences with 0 values after the list of words. Keras provides the pad_sequences() function to pad the sequences.

from keras.preprocessing.sequence import pad_sequences
# pad all sequences to a fixed length
max_length = max(len(s) for s in sequences)
print('Description Length: %d' % max_length)
padded = pad_sequences(sequences, maxlen=max_length, padding='post')

from keras.preprocessing.sequence import pad_sequences

# pad all sequences to a fixed length

max_length = max(len(s) for s in sequences)

print('Description Length: %d' % max_length)

padded = pad_sequences(sequences, maxlen=max_length, padding='post')

Finally, we can one hot encode the padded sequences to have one sparse vector for each word in the sequence. Keras provides the to_categorical() function to perform this operation.

from keras.utils import to_categorical
# one hot encode
y = to_categorical(padded, num_classes=vocab_size)

from keras.utils import to_categorical

# one hot encode

y = to_categorical(padded, num_classes=vocab_size)

Once encoded, we can ensure that the sequence output data has the right shape for the model.

y = y.reshape((len(descriptions), max_length, vocab_size))
print(y.shape)

1 2	y = y.reshape((len(descriptions), max_length, vocab_size)) print(y.shape)

Putting all of this together, the complete example is listed below.

from numpy import array
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# load clean descriptions into memory
def load_clean_descriptions(filename):
	doc = load_doc(filename)
	descriptions = dict()
	for line in doc.split('\n'):
		# split line by white space
		tokens = line.split()
		# split id from description
		image_id, image_desc = tokens[0], tokens[1:]
		# store
		descriptions[image_id] = ' '.join(image_desc)
	return descriptions

descriptions = load_clean_descriptions('descriptions.txt')
print('Loaded %d' % (len(descriptions)))
# extract all text
desc_text = list(descriptions.values())
# prepare tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(desc_text)
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)
# integer encode descriptions
sequences = tokenizer.texts_to_sequences(desc_text)
# pad all sequences to a fixed length
max_length = max(len(s) for s in sequences)
print('Description Length: %d' % max_length)
padded = pad_sequences(sequences, maxlen=max_length, padding='post')
# one hot encode
y = to_categorical(padded, num_classes=vocab_size)
y = y.reshape((len(descriptions), max_length, vocab_size))
print(y.shape)

from numpy import array

from keras.preprocessing.text import Tokenizer

from keras.preprocessing.sequence import pad_sequences

from keras.utils import to_categorical

# load doc into memory

def load_doc(filename):

# open the file as read only

file = open(filename, 'r')

# read all text

text = file.read()

# close the file

file.close()

return text

# load clean descriptions into memory

def load_clean_descriptions(filename):

doc = load_doc(filename)

descriptions = dict()

for line in doc.split('\n'):

# split line by white space

tokens = line.split()

# split id from description

image_id, image_desc = tokens[0], tokens[1:]

# store

descriptions[image_id] = ' '.join(image_desc)

return descriptions

descriptions = load_clean_descriptions('descriptions.txt')

print('Loaded %d' % (len(descriptions)))

# extract all text

desc_text = list(descriptions.values())

# prepare tokenizer

tokenizer = Tokenizer()

tokenizer.fit_on_texts(desc_text)

vocab_size = len(tokenizer.word_index) + 1

print('Vocabulary Size: %d' % vocab_size)

# integer encode descriptions

sequences = tokenizer.texts_to_sequences(desc_text)

# pad all sequences to a fixed length

max_length = max(len(s) for s in sequences)

print('Description Length: %d' % max_length)

padded = pad_sequences(sequences, maxlen=max_length, padding='post')

# one hot encode

y = to_categorical(padded, num_classes=vocab_size)

y = y.reshape((len(descriptions), max_length, vocab_size))

print(y.shape)

Running the example first prints the number of loaded image descriptions (8,092 photos), the dataset vocabulary size (4,485 words), the length of the longest description (28 words), then finally the shape of the data for fitting a prediction model in the form [samples, sequence length, features].

Loaded 8092
Vocabulary Size: 4485
Description Length: 28
(8092, 28, 4485)

Loaded 8092

Vocabulary Size: 4485

Description Length: 28

(8092, 28, 4485)

As mentioned, outputting the entire sequence may be challenging for the model.

We will look at a simpler model in the next section.

Word-By-Word Model

A simpler model for generating a caption for photographs is to generate one word given both the image as input and the last word generated.

This model would then have to be called recursively to generate each word in the description with previous predictions as input.

Using the word as input, give the model a forced context for predicting the next word in the sequence.

This is the model used in prior research, such as:

Show and Tell: A Neural Image Caption Generator, 2015.

A word embedding layer can be used to represent the input words. Like the feature extraction model for the photos, this too can be pre-trained either on a large corpus or on the dataset of all descriptions.

The model would take a full sequence of words as input; the length of the sequence would be the maximum length of descriptions in the dataset.

The model must be started with something. One approach is to surround each photo description with special tags to signal the start and end of the description, such as ‘STARTDESC’ and ‘ENDDESC’.

For example, the description:

boy rides horse

1	boy rides horse

Would become:

STARTDESC boy rides horse ENDDESC

1	STARTDESC boy rides horse ENDDESC

And would be fed to the model with the same image input to result in the following input-output word sequence pairs:

Input (X), 						Output (y)
STARTDESC, 						boy
STARTDESC, boy,					rides
STARTDESC, boy, rides, 			horse
STARTDESC, boy, rides, horse	ENDDESC

Input (X), Output (y)

STARTDESC, boy

STARTDESC, boy, rides

STARTDESC, boy, rides, horse

STARTDESC, boy, rides, horse ENDDESC

The data preparation would begin much the same as was described in the previous section.

Each description must be integer encoded. After encoding, the sequences are split into multiple input and output pairs and only the output word (y) is one hot encoded. This is because the model is only required to predict the probability distribution of one word at a time.

The code is the same up to the point where we calculate the maximum length of sequences.

...
descriptions = load_clean_descriptions('descriptions.txt')
print('Loaded %d' % (len(descriptions)))
# extract all text
desc_text = list(descriptions.values())
# prepare tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(desc_text)
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)
# integer encode descriptions
sequences = tokenizer.texts_to_sequences(desc_text)
# determine the maximum sequence length
max_length = max(len(s) for s in sequences)
print('Description Length: %d' % max_length)

...

descriptions = load_clean_descriptions('descriptions.txt')

print('Loaded %d' % (len(descriptions)))

# extract all text

desc_text = list(descriptions.values())

# prepare tokenizer

tokenizer = Tokenizer()

tokenizer.fit_on_texts(desc_text)

vocab_size = len(tokenizer.word_index) + 1

print('Vocabulary Size: %d' % vocab_size)

# integer encode descriptions

sequences = tokenizer.texts_to_sequences(desc_text)

# determine the maximum sequence length

max_length = max(len(s) for s in sequences)

print('Description Length: %d' % max_length)

Next, we split the each integer encoded sequence into input and output pairs.

Let’s step through a single sequence called seq at the i’th word in the sequence, where i >= 1.

First, we take the first i-1 words as the input sequence and the i’th word as the output word.

# split into input and output pair
in_seq, out_seq = seq[:i], seq[i]

1 2	# split into input and output pair in_seq, out_seq = seq[:i], seq[i]

Next, the input sequence is padded to the maximum length of the input sequences. Pre-padding is used (the default) so that new words appear at the end of the sequence, instead of the input beginning.

Pre-padding is used (the default) so that new words appear at the end of the sequence, instead of the beginning of the input.

# pad input sequence
in_seq = pad_sequences([in_seq], maxlen=max_length)[0]

1 2	# pad input sequence in_seq = pad_sequences([in_seq], maxlen=max_length)[0]

The output word is one hot encoded, much like in the previous section.

# encode output sequence
out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]

1 2	# encode output sequence out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]

We can put all of this together into a complete example to prepare description data for the word-by-word model.

from numpy import array
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# load clean descriptions into memory
def load_clean_descriptions(filename):
	doc = load_doc(filename)
	descriptions = dict()
	for line in doc.split('\n'):
		# split line by white space
		tokens = line.split()
		# split id from description
		image_id, image_desc = tokens[0], tokens[1:]
		# store
		descriptions[image_id] = ' '.join(image_desc)
	return descriptions

descriptions = load_clean_descriptions('descriptions.txt')
print('Loaded %d' % (len(descriptions)))
# extract all text
desc_text = list(descriptions.values())
# prepare tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(desc_text)
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)
# integer encode descriptions
sequences = tokenizer.texts_to_sequences(desc_text)
# determine the maximum sequence length
max_length = max(len(s) for s in sequences)
print('Description Length: %d' % max_length)

X, y = list(), list()
for img_no, seq in enumerate(sequences):
	# split one sequence into multiple X,y pairs
	for i in range(1, len(seq)):
		# split into input and output pair
		in_seq, out_seq = seq[:i], seq[i]
		# pad input sequence
		in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
		# encode output sequence
		out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
		# store
		X.append(in_seq)
		y.append(out_seq)

# convert to numpy arrays
X, y = array(X), array(y)
print(X.shape)
print(y.shape)

from numpy import array

from keras.preprocessing.text import Tokenizer

from keras.preprocessing.sequence import pad_sequences

from keras.utils import to_categorical

# load doc into memory

def load_doc(filename):

# open the file as read only

file = open(filename, 'r')

# read all text

text = file.read()

# close the file

file.close()

return text

# load clean descriptions into memory

def load_clean_descriptions(filename):

doc = load_doc(filename)

descriptions = dict()

for line in doc.split('\n'):

# split line by white space

tokens = line.split()

# split id from description

image_id, image_desc = tokens[0], tokens[1:]

# store

descriptions[image_id] = ' '.join(image_desc)

return descriptions

descriptions = load_clean_descriptions('descriptions.txt')

print('Loaded %d' % (len(descriptions)))

# extract all text

desc_text = list(descriptions.values())

# prepare tokenizer

tokenizer = Tokenizer()

tokenizer.fit_on_texts(desc_text)

vocab_size = len(tokenizer.word_index) + 1

print('Vocabulary Size: %d' % vocab_size)

# integer encode descriptions

sequences = tokenizer.texts_to_sequences(desc_text)

# determine the maximum sequence length

max_length = max(len(s) for s in sequences)

print('Description Length: %d' % max_length)

X, y = list(), list()

for img_no, seq in enumerate(sequences):

# split one sequence into multiple X,y pairs

for i in range(1, len(seq)):

# split into input and output pair

in_seq, out_seq = seq[:i], seq[i]

# pad input sequence

in_seq = pad_sequences([in_seq], maxlen=max_length)[0]

# encode output sequence

out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]

# store

X.append(in_seq)

y.append(out_seq)

# convert to numpy arrays

X, y = array(X), array(y)

print(X.shape)

print(y.shape)

Running the example prints the same statistics, but prints the size of the resulting encoded input and output sequences.

Note that the input of images must follow the exact same ordering where the same photo is shown for each example drawn from a single description. One way to do this would be to load the photo and store it for each example prepared from a single description.

Loaded 8092
Vocabulary Size: 4485
Description Length: 28
(66456, 28)
(66456, 4485)

Loaded 8092

Vocabulary Size: 4485

Description Length: 28

(66456, 28)

(66456, 4485)

Progressive Loading

The Flicr8K dataset of photos and descriptions can fit into RAM, if you have a lot of RAM (e.g. 8 Gigabytes or more), and most modern systems do.

This is fine if you want to fit a deep learning model using the CPU.

Alternately, if you want to fit a model using a GPU, then you will not be able to fit the data into memory of an average GPU video card.

One solution is to progressively load the photos and descriptions as-needed by the model.

Keras supports progressively loaded datasets by using the fit_generator() function on the model. A generator is the term used to describe a function used to return batches of samples for the model to train on. This can be as simple as a standalone function, the name of which is passed to the fit_generator() function when fitting the model.

As a reminder, a model is fit for multiple epochs, where one epoch is one pass through the entire training dataset, such as all photos. One epoch is comprised of multiple batches of examples where the model weights are updated at the end of each batch.

A generator must create and yield one batch of examples. For example, the average sentence length in the dataset is 11 words; that means that each photo will result in 11 examples for fitting the model and two photos will result in about 22 examples on average. A good default batch size for modern hardware may be 32 examples, so that is about 2-3 photos worth of examples.

We can write a custom generator to load a few photos and return the samples as a single batch.

Let’s assume we are working with a word-by-word model described in the previous section that expects a sequence of words and a prepared image as input and predicts a single word.

Let’s design a data generator that given a loaded dictionary of image identifiers to clean descriptions, a trained tokenizer, and a maximum sequence length will load one-image worth of examples for each batch.

A generator must loop forever and yield each batch of samples. If generators and yield are new concepts for you, consider reading this article:

Python Generators

We can loop forever with a while loop and within this, loop over each image in the image directory. For each image filename, we can load the image and create all of the input-output sequence pairs from the image’s description.

Below is the data generator function.

def data_generator(mapping, tokenizer, max_length):
	# loop for ever over images
	directory = 'Flicker8k_Dataset'
	while 1:
		for name in listdir(directory):
			# load an image from file
			filename = directory + '/' + name
			image, image_id = load_image(filename)
			# create word sequences
			desc = mapping[image_id]
			in_img, in_seq, out_word = create_sequences(tokenizer, max_length, desc, image)
			yield [[in_img, in_seq], out_word]

def data_generator(mapping, tokenizer, max_length):

# loop for ever over images

directory = 'Flicker8k_Dataset'

while 1:

for name in listdir(directory):

# load an image from file

filename = directory + '/' + name

image, image_id = load_image(filename)

# create word sequences

desc = mapping[image_id]

in_img, in_seq, out_word = create_sequences(tokenizer, max_length, desc, image)

yield [[in_img, in_seq], out_word]

You could extend it to take the name of the dataset directory as a parameter.

The generator returns an array containing the inputs (X) and output (y) for the model. The input is comprised of an array with two items for the input images and encoded word sequences. The outputs are one hot encoded words.

You can see that it calls a function called load_photo() to load a single photo and return the pixels and image identifier. This is a simplified version of the photo loading function developed at the beginning of this tutorial.

# load a single photo intended as input for the VGG feature extractor model
def load_photo(filename):
	image = load_img(filename, target_size=(224, 224))
	# convert the image pixels to a numpy array
	image = img_to_array(image)
	# reshape data for the model
	image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
	# prepare the image for the VGG model
	image = preprocess_input(image)[0]
	# get image id
	image_id = filename.split('/')[-1].split('.')[0]
	return image, image_id

# load a single photo intended as input for the VGG feature extractor model

def load_photo(filename):

image = load_img(filename, target_size=(224, 224))

# convert the image pixels to a numpy array

image = img_to_array(image)

# reshape data for the model

image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))

# prepare the image for the VGG model

image = preprocess_input(image)[0]

# get image id

image_id = filename.split('/')[-1].split('.')[0]

return image, image_id

Another function named create_sequences() is called to create sequences of images, input sequences of words, and output words that we then yield to the caller. This is a function that includes everything discussed in the previous section, and also creates copies of the image pixels, one for each input-output pair created from the photo’s description.

# create sequences of images, input sequences and output words for an image
def create_sequences(tokenizer, max_length, descriptions, images):
	Ximages, XSeq, y = list(), list(),list()
	vocab_size = len(tokenizer.word_index) + 1
	for j in range(len(descriptions)):
		seq = descriptions[j]
		image = images[j]
		# integer encode
		seq = tokenizer.texts_to_sequences([seq])[0]
		# split one sequence into multiple X,y pairs
		for i in range(1, len(seq)):
			# select
			in_seq, out_seq = seq[:i], seq[i]
			# pad input sequence
			in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
			# encode output sequence
			out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
			# store
			Ximages.append(image)
			XSeq.append(in_seq)
			y.append(out_seq)
	Ximages, XSeq, y = array(Ximages), array(XSeq), array(y)
	return Ximages, XSeq, y

# create sequences of images, input sequences and output words for an image

def create_sequences(tokenizer, max_length, descriptions, images):

Ximages, XSeq, y = list(), list(),list()

vocab_size = len(tokenizer.word_index) + 1

for j in range(len(descriptions)):

seq = descriptions[j]

image = images[j]

# integer encode

seq = tokenizer.texts_to_sequences([seq])[0]

# split one sequence into multiple X,y pairs

for i in range(1, len(seq)):

# select

in_seq, out_seq = seq[:i], seq[i]

# pad input sequence

in_seq = pad_sequences([in_seq], maxlen=max_length)[0]

# encode output sequence

out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]

# store

Ximages.append(image)

XSeq.append(in_seq)

y.append(out_seq)

Ximages, XSeq, y = array(Ximages), array(XSeq), array(y)

return Ximages, XSeq, y

Prior to preparing the model that uses the data generator, we must load the clean descriptions, prepare the tokenizer, and calculate the maximum sequence length. All 3 of must be passed to the data_generator() as parameters.

We use the same load_clean_descriptions() function developed previously and a new create_tokenizer() function that simplifies the creation of the tokenizer.

Tying all of this together, the complete data generator is listed below, ready for use to train a model.

from os import listdir
from numpy import array
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.preprocessing.image import load_img
from keras.preprocessing.image import img_to_array
from keras.applications.vgg16 import preprocess_input

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# load clean descriptions into memory
def load_clean_descriptions(filename):
	doc = load_doc(filename)
	descriptions = dict()
	for line in doc.split('\n'):
		# split line by white space
		tokens = line.split()
		# split id from description
		image_id, image_desc = tokens[0], tokens[1:]
		# store
		descriptions[image_id] = ' '.join(image_desc)
	return descriptions

# fit a tokenizer given caption descriptions
def create_tokenizer(descriptions):
	lines = list(descriptions.values())
	tokenizer = Tokenizer()
	tokenizer.fit_on_texts(lines)
	return tokenizer

# load a single photo intended as input for the VGG feature extractor model
def load_photo(filename):
	image = load_img(filename, target_size=(224, 224))
	# convert the image pixels to a numpy array
	image = img_to_array(image)
	# reshape data for the model
	image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
	# prepare the image for the VGG model
	image = preprocess_input(image)[0]
	# get image id
	image_id = filename.split('/')[-1].split('.')[0]
	return image, image_id

# create sequences of images, input sequences and output words for an image
def create_sequences(tokenizer, max_length, desc, image):
	Ximages, XSeq, y = list(), list(),list()
	vocab_size = len(tokenizer.word_index) + 1
	# integer encode the description
	seq = tokenizer.texts_to_sequences([desc])[0]
	# split one sequence into multiple X,y pairs
	for i in range(1, len(seq)):
		# select
		in_seq, out_seq = seq[:i], seq[i]
		# pad input sequence
		in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
		# encode output sequence
		out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
		# store
		Ximages.append(image)
		XSeq.append(in_seq)
		y.append(out_seq)
	Ximages, XSeq, y = array(Ximages), array(XSeq), array(y)
	return [Ximages, XSeq, y]

# data generator, intended to be used in a call to model.fit_generator()
def data_generator(descriptions, tokenizer, max_length):
	# loop for ever over images
	directory = 'Flicker8k_Dataset'
	while 1:
		for name in listdir(directory):
			# load an image from file
			filename = directory + '/' + name
			image, image_id = load_photo(filename)
			# create word sequences
			desc = descriptions[image_id]
			in_img, in_seq, out_word = create_sequences(tokenizer, max_length, desc, image)
			yield [[in_img, in_seq], out_word]

# load mapping of ids to descriptions
descriptions = load_clean_descriptions('descriptions.txt')
# integer encode sequences of words
tokenizer = create_tokenizer(descriptions)
# pad to fixed length
max_length = max(len(s.split()) for s in list(descriptions.values()))
print('Description Length: %d' % max_length)

# test the data generator
generator = data_generator(descriptions, tokenizer, max_length)
inputs, outputs = next(generator)
print(inputs[0].shape)
print(inputs[1].shape)
print(outputs.shape)

100

101

from os import listdir

from numpy import array

from keras.preprocessing.text import Tokenizer

from keras.preprocessing.sequence import pad_sequences

from keras.utils import to_categorical

from keras.preprocessing.image import load_img

from keras.preprocessing.image import img_to_array

from keras.applications.vgg16 import preprocess_input

# load doc into memory

def load_doc(filename):

# open the file as read only

file = open(filename, 'r')

# read all text

text = file.read()

# close the file

file.close()

return text

# load clean descriptions into memory

def load_clean_descriptions(filename):

doc = load_doc(filename)

descriptions = dict()

for line in doc.split('\n'):

# split line by white space

tokens = line.split()

# split id from description

image_id, image_desc = tokens[0], tokens[1:]

# store

descriptions[image_id] = ' '.join(image_desc)

return descriptions

# fit a tokenizer given caption descriptions

def create_tokenizer(descriptions):

lines = list(descriptions.values())

tokenizer = Tokenizer()

tokenizer.fit_on_texts(lines)

return tokenizer

# load a single photo intended as input for the VGG feature extractor model

def load_photo(filename):

image = load_img(filename, target_size=(224, 224))

# convert the image pixels to a numpy array

image = img_to_array(image)

# reshape data for the model

image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))

# prepare the image for the VGG model

image = preprocess_input(image)[0]

# get image id

image_id = filename.split('/')[-1].split('.')[0]

return image, image_id

# create sequences of images, input sequences and output words for an image

def create_sequences(tokenizer, max_length, desc, image):

Ximages, XSeq, y = list(), list(),list()

vocab_size = len(tokenizer.word_index) + 1

# integer encode the description

seq = tokenizer.texts_to_sequences([desc])[0]

# split one sequence into multiple X,y pairs

for i in range(1, len(seq)):

# select

in_seq, out_seq = seq[:i], seq[i]

# pad input sequence

in_seq = pad_sequences([in_seq], maxlen=max_length)[0]

# encode output sequence

out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]

# store

Ximages.append(image)

XSeq.append(in_seq)

y.append(out_seq)

Ximages, XSeq, y = array(Ximages), array(XSeq), array(y)

return [Ximages, XSeq, y]

# data generator, intended to be used in a call to model.fit_generator()

def data_generator(descriptions, tokenizer, max_length):

# loop for ever over images

directory = 'Flicker8k_Dataset'

while 1:

for name in listdir(directory):

# load an image from file

filename = directory + '/' + name

image, image_id = load_photo(filename)

# create word sequences

desc = descriptions[image_id]

in_img, in_seq, out_word = create_sequences(tokenizer, max_length, desc, image)

yield [[in_img, in_seq], out_word]

# load mapping of ids to descriptions

descriptions = load_clean_descriptions('descriptions.txt')

# integer encode sequences of words

tokenizer = create_tokenizer(descriptions)

# pad to fixed length

max_length = max(len(s.split()) for s in list(descriptions.values()))

print('Description Length: %d' % max_length)

# test the data generator

generator = data_generator(descriptions, tokenizer, max_length)

inputs, outputs = next(generator)

print(inputs[0].shape)

print(inputs[1].shape)

print(outputs.shape)

A data generator can be tested by calling the next() function.

We can test the generator as follows.

# test the data generator
generator = data_generator(descriptions, tokenizer, max_length)
inputs, outputs = next(generator)
print(inputs[0].shape)
print(inputs[1].shape)
print(outputs.shape)

# test the data generator

generator = data_generator(descriptions, tokenizer, max_length)

inputs, outputs = next(generator)

print(inputs[0].shape)

print(inputs[1].shape)

print(outputs.shape)

Running the example prints the shape of the input and output example for a single batch (e.g. 13 input-output pairs):

(13, 224, 224, 3)
(13, 28)
(13, 4485)

(13, 224, 224, 3)

(13, 28)

(13, 4485)

The generator can be used to fit a model by calling the fit_generator() function on the model (instead of fit()) and passing in the generator.

We must also specify the number of steps or batches per epoch. We could estimate this as (10 x training dataset size), perhaps 70,000 if 7,000 images are used for training.

# define model
# ...
# fit model
model.fit_generator(data_generator(descriptions, tokenizer, max_length), steps_per_epoch=70000, ...)

# define model

# ...

# fit model

model.fit_generator(data_generator(descriptions, tokenizer, max_length), steps_per_epoch=70000, ...)

Summary

In this tutorial, you discovered how to prepare photos and textual descriptions ready for developing an automatic photo caption generation model.

Specifically, you learned:

About the Flickr8K dataset comprised of more than 8,000 photos and up to 5 captions for each photo.
How to generally load and prepare photo and text data for modeling with deep learning.
How to specifically encode data for two different types of deep learning models in Keras.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

78 Responses to How to Prepare a Photo Caption Dataset for Training a Deep Learning Model

Adel November 15, 2017 at 9:13 am #

Is this topic included in your new book ?

Reply
- Jason Brownlee November 15, 2017 at 9:59 am #
  
  Yes, I have a suite of chapters on developing a caption generation model.
  
  Reply
  - Shaurya Pratap Singh November 9, 2018 at 4:02 pm #
    
    Hi jason awesome content, but i am not able to understand why did you used while( 1 ):
    in line number 78 in the full code.
    wouldn’t it work the same way withouot using while(1)??
    
    thanks…!!
    def data_generator(descriptions, tokenizer, max_length):
    # loop for ever over images
    directory = ‘Flicker8k_Dataset’
    while 1:————————————————————————->>>>>>line of doubt
    for name in listdir(directory):
    # load an image from file
    filename = directory + ‘/’ + name
    image, image_id = load_photo(filename)
    # create word sequences
    desc = descriptions[image_id]
    in_img, in_seq, out_word = create_sequences(tokenizer, max_length, desc, image)
    yield [[in_img, in_seq], out_word]
    
    Reply
    - Jason Brownlee November 10, 2018 at 5:56 am #
      
      It is a Python generator, you can learn more about generators here:
      https://wiki.python.org/moin/Generators
      
      Reply
Emil November 15, 2017 at 8:50 pm #

This is brilliant!!! Thanks for putting this together – thoroughly appreciated! ????

Reply
- Jason Brownlee November 16, 2017 at 10:27 am #
  
  You’re welcome, I’m glad it helped!
  
  Reply
  - Ankit February 10, 2018 at 7:48 pm #
    
    Hi Jason , I find your work very helpful , have you also implemented bottom up approach (dense captioning ) of generating image captions.
    
    Reply
    - Jason Brownlee February 11, 2018 at 7:55 am #
      
      Does this help:
      https://machinelearningmastery.com/develop-a-deep-learning-caption-generation-model-in-python/
      
      Reply
      - Ankit April 10, 2018 at 6:24 pm #
        
        Hi Jason, this is top down approach , bottom up approach is different also called dense captioning, where we identify objects in an image the combine them to form a description.
      - Jason Brownlee April 11, 2018 at 6:32 am #
        
        Thanks.
Joe Weber November 17, 2017 at 7:14 am #

Hi Jason, Isnt the data generator function supposed to call load_photo() instead of load_image()?

Reply
- Jason Brownlee November 17, 2017 at 9:30 am #
  
  In the full example, the data generator does call load_photo() on line 82.
  
  Reply
Irjam November 18, 2017 at 1:54 am #

Hi Jason, I am a newbie in Python and CNN. Can I have testing source code in which I input an image and it gives output with the caption?

Reply
- Jason Brownlee November 18, 2017 at 10:20 am #
  
  Yes, I will have some on the blog soon and in my new book on deep learning for NLP to be released soon.
  
  Reply
Moustapha Cheikh November 18, 2017 at 9:05 am #

Enjoyed reading your articles, you really explains everything in detail, Couldn’t be write much better! the article is very interesting and effective.

Note: There is an typing error in the first time you mentioned load_clean_descriptions

mapping[image_id] = ‘ ‘.join(image_desc)

sould be

descriptions[image_id] = ‘ ‘.join(image_desc)

Thanks for sharing such interesting blog.

Reply
- Jason Brownlee November 18, 2017 at 10:28 am #
  
  Fixed, thanks!
  
  Reply
Matthew November 28, 2017 at 12:49 pm #

Hi,
enjoy following your blog.

I’m seeing an error here

def save_doc(descriptions, filename):
lines = list()
for key, desc in mapping.items():

this threw me for a bit until I saw mapping is returned from
def load_descriptions(doc)

fix below

def save_doc(descriptions, filename):
lines = list()
for key, desc in descriptions.items():

replaces
for key, desc in mapping.items():

Reply
- Jason Brownlee November 29, 2017 at 8:16 am #
  
  Ouch, I’ve fixed that example too, cheers.
  
  Reply
Ranjith December 4, 2017 at 3:31 am #

Hi Jason, when I run the descriptions, I am getting the following error,
FileNotFoundError: [Errno 2] No such file or directory: ‘Flickr8k_text/Flickr8k.token.txt’, can you please help me with this please, I am very new to deep learning.

Reply
- Jason Brownlee December 4, 2017 at 7:59 am #
  
  You must download the dataset and place it in the same directory as the code.
  
  Try running from the command line, sometimes IDEs and Notebooks can mask or introduce errors.
  
  Reply
Felix Fu January 2, 2018 at 9:03 pm #

Hi Jason, thanks for this awesome post, I really enjoyed reading it. By the way, I think there is a typo when you talk about the number of steps per epoch. I think it should read “perhaps 70,000 if 7,000 images are used for training.”.

Reply
- Jason Brownlee January 3, 2018 at 5:35 am #
  
  Thanks, fixed.
  
  Reply
Joan January 8, 2018 at 8:03 am #

Hi Jason,
When I running dump(features, open(‘features.pkl’, ‘wb’)) ,I getting the following error: “feature.pkl is not UTF-8 encoded ”
Also I try to dump the output of predict function using only the first image.
It was like this:
{‘667626_18933d713e’: array([[[[ 0. , 0. , 0. , …, 0. ,
10.62594891, 0. ],
[ 0. , 0. , 0. , …, 0. ,
0. , 0. ],
[ 0. , 0. , 0. , …, 0. ,
0. , 0. ],
…,
[ 0. , 0. , 0. , …, 0. ,
0. , 0. ],
[ 0. , 0. , 0. , …, 0. ,
0. , 0. ],
[ 0. , 0. , 0. , …, 0. ,
9.41605377, 0. ]],

[[ 0. , 0. , 0. , …, 0. ,
0. , 0. ],
[ 0. , 0. , 0. , …, 0. ,
0. , 5.36805296],
[ 0. , 0. , 0. , …, 0. ,
0. , 0. ],
…,
[ 0. , 0. , 0. , …, 1.45877278,
0. , 39.37923431],
[ 0. , 0. , 0. , …, 0. ,
0. , 1.39090693],
[ 0. , 0. , 0. , …, 0. ,
3.93747687, 0. ]],

[[ 0. , 0. , 0. , …, 0. ,
18.81423187, 0. ],
[ 0. , 0. , 0. , …, 7.79979277,
0. , 0. ],
[ 0. , 0. , 0. , …, 0. ,
0. , 9.14055347],
…,
[ 0. , 0. , 0. , …, 48.84911346,
0. , 12.12792015],
[ 0. , 0. , 0. , …, 0. ,
0. , 0. ],
[ 0. , 0. , 0. , …, 0. ,
2.0710113 , 0. ]],

…,
[[ 0. , 0. , 0. , …, 0. ,
0. , 0. ],
[ 0. , 3.75439334, 0. , …, 0. ,
0. , 0. ],
[ 3.71412587, 0. , 0. , …, 0. ,
0. , 0. ],
…,
[ 0. , 0. , 0. , …, 0. ,
0. , 0. ],
[ 0. , 0. , 0. , …, 0. ,
18.80825424, 0. ],
[ 0. , 0. , 0. , …, 0. ,
13.0358696 , 0. ]],

[[ 0. , 0. , 0. , …, 0. ,
4.03412676, 0. ],
[ 0. , 0. , 0. , …, 0. ,
0. , 0. ],
[ 0. , 0. , 0. , …, 0. ,
0. , 0. ],
…,
[ 0. , 0. , 0. , …, 0. ,
0. , 0. ],
[ 0. , 0. , 0. , …, 0. ,
7.99308109, 0. ],
[ 0. , 0. , 0. , …, 0. ,
32.52854919, 0. ]],

[[ 0. , 0. , 0. , …, 0. ,
33.73991013, 0. ],
[ 0. , 0. , 0. , …, 0. ,
14.52160454, 0. ],
[ 0. , 0. , 0. , …, 0. ,
4.05761242, 0. ],
…,
[ 0. , 0. , 0. , …, 0. ,
0. , 0. ],
[ 0. , 0. , 0. , …, 0.90452403,
0. , 0. ],
[ 0. , 0. , 0. , …, 29.89839745,
38.23991394, 0. ]]]], dtype=float32)}
And I was confused about whether this result correct or not.
Could you help me with this?
I really don’t know why my feature.pkl can save successfully.
Thank you so much.

.

Reply
- Jason Brownlee January 8, 2018 at 3:52 pm #
  
  Sorry, I have not seen that error before. Perhaps the full error message to stackoverflow?
  
  Reply
  - Joan January 9, 2018 at 1:54 pm #
    
    Thank you for the prompt reply.I’ll try.
    
    Reply
Karthik January 19, 2018 at 9:45 pm #

Hello Jason,

I’m able to download dataset, the link is unavailable. Could you please help me here.

Thanks,
Karthik

Reply
- Karthik January 19, 2018 at 9:45 pm #
  
  *not able to
  
  Reply
  - Karthik January 19, 2018 at 9:54 pm #
    
    Jason , i could able to download now. Please ignore above my comments.
    
    Thanks,Karthik
    
    Reply
    - Jason Brownlee January 20, 2018 at 8:19 am #
      
      No problem.
      
      Reply
- Jason Brownlee January 20, 2018 at 8:19 am #
  
  You must fill out this form:
  https://forms.illinois.edu/sec/1713398
  
  Reply
Karthik February 7, 2018 at 6:37 pm #

Jason,

I got model-ep005-loss3.517-val_loss4.012 .

Another good article.

Thanks ,
Karthik

Reply
- Jason Brownlee February 8, 2018 at 8:24 am #
  
  Very Nice!
  
  Reply
- Harsha April 1, 2018 at 3:14 pm #
  
  karthik can u send me a link to download model file
  
  Reply
  - deep_ml April 9, 2018 at 5:10 pm #
    
    Karthik can you provide link to download model file ?
    
    Reply
rhinorn February 10, 2018 at 2:24 am #

Hi, Jason
I am using GPU to fit the model, but it takes too loooooooooooooong time!
More or less than 9300 seconds for each epoch.
My hardware: NVIDA GTX 850M(compute capability 5.0), GPU memory 4GiB
and my computer Memory is 8GiB
OS: Ubuntu 16.04
If i use the cpu mode, I got the Memory Error:
================= Error ===============
Traceback (most recent call last):
File “ICmodel.py”, line 217, in
X1train, X2train, ytrain = create_sequences(tokenizer, max_length, train_descriptions,train_features)
File “ICmodel.py”, line 162, in create_sequences
return array(X1),array(X2),array(y)
MemoryError
===============End of Error==============
So I have to use my gpu to run the training program. Here is my code after modifying yours above, is there any incorrect modification?
==================== Code ====================
def data_generator(mapping, tokenizer, max_length, features):
# loop for ever over images
directory = ‘Flickr8k_Dataset’
while 1:
for name in listdir(directory):
# load an image from file
filename = directory + ‘/’ + name
image_id = name.split(‘.’)[0]
# create word sequences
if image_id not in mapping:
continue
desc_list = mapping[image_id]
img_feature = features[image_id][0]
in_img, in_seq, out_word = create_sequences4list(tokenizer,max_length, desc_list, img_feature)
yield [[in_img,in_seq], out_word]

# create sequences of feature, input sequences and output words for an image
def create_sequences4list(tokenizer, max_length, desc_list, photo):
Xfe, XSeq, y = list(), list(),list()
vocab_size = len(tokenizer.word_index) + 1
# integer encode the description
for desc in desc_list:
seq = tokenizer.texts_to_sequences([desc])[0]
# split one sequence into multiple X,y pairs
for i in range(1, len(seq)):
# select
in_seq, out_seq = seq[:i], seq[i]
# pad input sequence
in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
# encode output sequence
out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
# store
Xfe.append(photo)
XSeq.append(in_seq)
y.append(out_seq)
Xfe, XSeq, y = array(Xfe), array(XSeq), array(y)
return [Xfe, XSeq, y]
======================End of Code =================
Time-consuming running is disaster, could you give me some advice? thx.

Reply
- Jason Brownlee February 10, 2018 at 8:58 am #
  
  You might need more RAM. Perhaps change the code to use progressive loading?
  
  Reply
rhinorn February 10, 2018 at 2:56 pm #

Thank you for your reply.
Progressive loading is to use the python generator? What I have post above are exactly the generator function and create_sequence function adapted for the generator. Sorry for the disappeared indents…
What I am confused is that whether I need to yield per line of descriptions or yield all five descriptions for one photo at one time?

Reply
- Jason Brownlee February 11, 2018 at 7:52 am #
  
  Good question, I think you could yield every few descriptions. Even experiment a little to see what your hardware can handle.
  
  Reply
Vineeth February 14, 2018 at 5:34 pm #

Hey Jason Brownlee, I used this progressive Loading with this https://machinelearningmastery.com/develop-a-deep-learning-caption-generation-model-in-python/#comment-429456 tutorial.

and i’m getting this error. Can you please tell me how to define model for this particular generator ?

ValueError: Error when checking input: expected input_1 to have 2 dimensions, but got array with shape (13, 224, 224, 3)

I’m new to machine learning, Thanks for your wonderful tutorial !

Reply
- Vineeth February 14, 2018 at 6:29 pm #
  
  I’ve managed to fix that one by adding inputs1 = Input(shape=(224, 224, 3)) and now have different error. Please help
  
  ValueError: Error when checking target: expected dense_3 to have 4 dimensions, but got array with shape (13, 4485)
  
  Reply
Vineeth February 14, 2018 at 8:50 pm #

Please help on the model part. I am unable to run this. And I don’t yet have the understanding required to calculate the numbers myself

Reply
- Srinath Hanumantha Rao March 20, 2018 at 5:52 pm #
  
  were you able to solve the issue? I am stuck with the same error
  
  Reply
Vishnu February 15, 2018 at 10:44 pm #

ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host

Error in downloading VGG16 Model.

Can You please help me to fix it out..?

Reply
- Jason Brownlee February 16, 2018 at 8:34 am #
  
  Sorry to hear that, sounds like an internet connection issue. Perhaps try again?
  
  Reply
Akash March 22, 2018 at 12:37 am #

Can anyone show me how to compile the VGG 16 model for the progressive loading in this example?
Thanx in advance.

Reply
Kaustub March 22, 2018 at 5:27 pm #

Please help me to define the model ,i have used the data generator which is working fine but having trouble defining the model

Reply
- Jason Brownlee March 23, 2018 at 6:03 am #
  
  Perhaps you can summarize your problem in a few lines?
  
  Reply
Kaustub March 23, 2018 at 4:19 pm #

I need a code for define model which is used before model fitting in the code:
# define model
# …
# fit model
model.fit_generator(data_generator(descriptions, tokenizer, max_length), steps_per_epoch=70000, …)

Reply
- Akash March 23, 2018 at 6:34 pm #
  
  Same here jason, i have been going over your ebooks to find some solution but getting no where…could u please give the code to define the model used in the progressive loading example such that we can use it with this :
  
  model.fit_generator(data_generator(descriptions, tokenizer, max_length), steps_per_epoch=70000, …)
  
  Reply
  - Harsha April 1, 2018 at 3:16 pm #
    
    same problem here please help
    
    Reply
- Ishani March 31, 2018 at 2:25 am #
  
  have you figured it out? I yes, please can you explain!
  
  Reply
Akash March 23, 2018 at 6:34 pm #

Thanx in advance.

Reply
Harsha April 1, 2018 at 2:40 pm #

After progressive loading how to evaluate the model and how to generate captions for new images.

Reply
- Jason Brownlee April 2, 2018 at 5:18 am #
  
  See this post:
  https://machinelearningmastery.com/develop-a-deep-learning-caption-generation-model-in-python/
  
  Reply
Divyanshu Kapoor April 1, 2018 at 4:10 pm #

Hello Jason,
I have one question regarding your discussion. As you said that steps_per_epoch will be 10*training data size i.e. 70,000 so what will happen if take steps_per_epoch equal to 70 instead of 70,000.
Do increasing no of steps_per_epoch result in better model?

Reply
- Jason Brownlee April 2, 2018 at 5:21 am #
  
  Slower training. Perhaps worse model skill given the large increase in weight update frequency.
  
  Reply
Moha Ali June 26, 2018 at 6:34 am #

Would you say that the Whole Description Sequence Model and Word-By-Word model are RNN based?

Reply
- Jason Brownlee June 26, 2018 at 6:44 am #
  
  Sure.
  
  Reply
Saurav October 29, 2018 at 4:38 am #

Hello Jason ,
I am trying to run the code for extracting features from the photos in the flickr dataset, provided by you , but it showing following error:
‘AttributeError: ‘InputLayer’ object has no attribute ‘outbound_nodes’

Reply
- Jason Brownlee October 29, 2018 at 6:02 am #
  
  I have some suggestions here:
  https://machinelearningmastery.com/faq/single-faq/why-does-the-code-in-the-tutorial-not-work-for-me
  
  Reply
Ajay Dhamija November 16, 2018 at 6:43 am #

have you written tutorial on VQA. Can you suggest any python source where we can learn this.

Reply
- Jason Brownlee November 16, 2018 at 1:55 pm #
  
  What is VQA?
  
  Reply
Chen Mei November 19, 2018 at 1:30 pm #

anyone know how to solve this error ?

ValueError: Error when checking input: expected input_1 to have 2 dimensions, but got array with shape (61317, 7, 7, 512)

Reply
- Jason Brownlee November 19, 2018 at 2:19 pm #
  
  I have some suggestions here:
  https://machinelearningmastery.com/faq/single-faq/why-does-the-code-in-the-tutorial-not-work-for-me
  
  Reply
ABHINAY RK January 6, 2019 at 2:09 pm #

Can this code be used to prepare the image data for keras when ever I am using transfer learning?

Reply
- Jason Brownlee January 7, 2019 at 6:26 am #
  
  Yes, somewhat.
  
  Reply
khattak February 13, 2019 at 12:06 am #

Respected Sir,

I am facing the following error:

python3.7/site-packages/keras/engine/training_utils.py”, line 102, in standardize_input_data
str(len(data)) + ‘ arrays: ‘ + str(data)[:200] + ‘…’)

ValueError: Error when checking model input: the list of Numpy arrays that you are passing to your model is not the size the model expected. Expected to see 1 array(s), but instead got the following list of 2 arrays: [array([[[[ 66.061 , 106.221 , 112.32 ],
[ 63.060997 , 97.221 , 111.32 ],
[ 57.060997 , 96.221 , 105.32 ],
…,
[ 43.060997 , 92.221 ,…

Kindly guide me…

Reply
- Jason Brownlee February 13, 2019 at 8:00 am #
  
  I have some suggestions here:
  https://machinelearningmastery.com/faq/single-faq/why-does-the-code-in-the-tutorial-not-work-for-me
  
  Reply
Bon May 28, 2019 at 8:58 pm #

Hi Jason, I have collected the Flicker 8k dataset and have done some translation to local languages, but now, I want to expand my dataset. Is there any similarity between the Flicker 8k and Flicker 30k dataset like 8k is a subset of 30k. As, it can be seen the file naming in 8k and 30k are different. Do you have any idea regarding that?

Reply
- Jason Brownlee May 29, 2019 at 8:41 am #
  
  Sorry. I don’t know.
  
  Reply
mostafa November 3, 2020 at 6:12 am #

Hi Jason i want do image captioning for clothes and want dataset for it
if you have dataset for this please give me or iam so glad help me how create dataset with caption for it
thanks

Reply
- Jason Brownlee November 3, 2020 at 6:56 am #
  
  This may help:
  https://machinelearningmastery.com/faq/single-faq/where-can-i-get-a-dataset-on-___
  
  Reply
Kiran March 3, 2021 at 1:02 am #

should i insert the flicker8k dataset into jupyter notebook??

Reply
- Jason Brownlee March 3, 2021 at 5:37 am #
  
  I recommend not using a notebook:
  https://machinelearningmastery.com/faq/single-faq/why-dont-use-or-recommend-notebooks
  
  Reply
Azaz Butt June 5, 2021 at 9:38 pm #

Hi Jason.

I’m trying to fit the model using data generator, but getting this error:

ValueError: in user code:

/usr/local/lib/python3.7/dist-packages/keras/engine/training.py:830 train_function *
return step_function(self, iterator)
/usr/local/lib/python3.7/dist-packages/keras/engine/training.py:813 run_step *
outputs = model.train_step(data)
/usr/local/lib/python3.7/dist-packages/keras/engine/training.py:770 train_step *
y_pred = self(x, training=True)
/usr/local/lib/python3.7/dist-packages/keras/engine/base_layer.py:989 __call__ *
input_spec.assert_input_compatibility(self.input_spec, inputs, self.name)
/usr/local/lib/python3.7/dist-packages/keras/engine/input_spec.py:197 assert_input_compatibility *
raise ValueError(‘Layer ‘ + layer_name + ‘ expects ‘ +

ValueError: Layer model_15 expects 2 input(s), but it received 3 input tensors. Inputs received: [, , ]

Can you please help me in this regard?

Reply
- Jason Brownlee June 6, 2021 at 5:50 am #
  
  Sorry to hear that, these tips may help:
  https://machinelearningmastery.com/faq/single-faq/why-does-the-code-in-the-tutorial-not-work-for-me
  
  Reply
Pranav March 2, 2022 at 12:50 am #

how can we evaluate the imagecaption model in one go…

i.e All metric score calculated in one go

Reply
- James Carmichael March 2, 2022 at 12:26 pm #
  
  Hi Pranav…The following may be of interest to you:
  
  https://machinelearningmastery.com/how-to-evaluate-pixel-scaling-methods-for-image-classification/
  
  Reply

Navigation

How to Prepare a Photo Caption Dataset for Training a Deep Learning Model

Tutorial Overview

Python Environment

Need help with Deep Learning for Text Data?

Download the Flickr8K Dataset

How to Load Photographs

Pre-Calculate Photo Features

How to Load Descriptions

Prepare Description Text

Whole Description Sequence Model

Word-By-Word Model

Progressive Loading

Further Reading

Flickr8K Dataset

API

Summary

Develop Deep Learning models for Text Data Today!

Develop Your Own Text models in Minutes

Finally Bring Deep Learning to your Natural Language Processing Projects

More On This Topic

78 Responses to How to Prepare a Photo Caption Dataset for Training a Deep Learning Model

Leave a Reply Click here to cancel reply.