How to Prepare a French-to-English Dataset for Machine Translation

By Jason Brownlee on April 30, 2020 in Deep Learning for Natural Language Processing 56

Machine translation is the challenging task of converting text from a source language into coherent and matching text in a target language.

Neural machine translation systems such as encoder-decoder recurrent neural networks are achieving state-of-the-art results for machine translation with a single end-to-end system trained directly on source and target language.

Standard datasets are required to develop, explore, and familiarize yourself with how to develop neural machine translation systems.

In this tutorial, you will discover the Europarl standard machine translation dataset and how to prepare the data for modeling.

After completing this tutorial, you will know:

The Europarl dataset comprised of the proceedings from the European Parliament in a host of 11 languages.
How to load and clean the parallel French and English transcripts ready for modeling in a neural machine translation system.
How to reduce the vocabulary size of both French and English data in order to reduce the complexity of the translation task.

Kick-start your project with my new book Deep Learning for Natural Language Processing, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

How to Prepare a French-to-English Dataset for Machine Translation
Photo by Giuseppe Milo, some rights reserved.

Tutorial Overview

This tutorial is divided into 5 parts; they are:

Europarl Machine Translation Dataset
Download French-English Dataset
Load Dataset
Clean Dataset
Reduce Vocabulary

Python Environment

This tutorial assumes you have a Python SciPy environment installed with Python 3 installed.

The tutorial also assumes you have scikit-learn, Pandas, NumPy, and Matplotlib installed.

If you need help with your environment, see this post:

How to Setup a Python Environment for Machine Learning and Deep Learning with Anaconda

Need help with Deep Learning for Text Data?

Take my free 7-day email crash course now (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

Europarl Machine Translation Dataset

The Europarl is a standard dataset used for statistical machine translation, and more recently, neural machine translation.

It is comprised of the proceedings of the European Parliament, hence the name of the dataset as the contraction Europarl.

The proceedings are the transcriptions of speakers at the European Parliament, which are translated into 11 different languages.

It is a collection of the proceedings of the European Parliament, dating back to 1996. Altogether, the corpus comprises of about 30 million words for each of the 11 official languages of the European Union

— Europarl: A Parallel Corpus for Statistical Machine Translation, 2005.

The raw data is available on the European Parliament website in HTML format.

The creation of the dataset was lead by Philipp Koehn, author of the book “Statistical Machine Translation.”

The dataset was made available for free to researchers on the website “European Parliament Proceedings Parallel Corpus 1996-2011,” and often appears as a part of machine translation challenges, such as the Machine Translation task in the 2014 Workshop on Statistical Machine Translation.

The most recent version of the dataset is version 7, released in 2012, comprised of data from 1996 to 2011.

Download French-English Dataset

We will focus on the parallel French-English dataset.

This is a prepared corpus of aligned French and English sentences recorded between 1996 and 2011.

The dataset has the following statistics:

Sentences: 2,007,723
French words: 51,388,643
English words: 50,196,035

You can download the dataset from here:

Parallel corpus French-English (194 Megabytes)

Once downloaded, you should have the file “fr-en.tgz” in your current working directory.

You can unzip this archive file using the tar command, as follows:

tar zxvf fr-en.tgz

1	tar zxvf fr-en.tgz

You will now have two files, as follows:

English: europarl-v7.fr-en.en (288M)
French: europarl-v7.fr-en.fr (331M)

Below is a sample of the English file.

Resumption of the session
I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999, and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period.
Although, as you will have seen, the dreaded 'millennium bug' failed to materialise, still the people in a number of countries suffered a series of natural disasters that truly were dreadful.
You have requested a debate on this subject in the course of the next few days, during this part-session.
In the meantime, I should like to observe a minute' s silence, as a number of Members have requested, on behalf of all the victims concerned, particularly those of the terrible storms, in the various countries of the European Union.

Resumption of the session

I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999, and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period.

Although, as you will have seen, the dreaded 'millennium bug' failed to materialise, still the people in a number of countries suffered a series of natural disasters that truly were dreadful.

You have requested a debate on this subject in the course of the next few days, during this part-session.

In the meantime, I should like to observe a minute' s silence, as a number of Members have requested, on behalf of all the victims concerned, particularly those of the terrible storms, in the various countries of the European Union.

Below is a sample of the French file.

Reprise de la session
Je déclare reprise la session du Parlement européen qui avait été interrompue le vendredi 17 décembre dernier et je vous renouvelle tous mes vux en espérant que vous avez passé de bonnes vacances.
Comme vous avez pu le constater, le grand "bogue de l'an 2000" ne s'est pas produit. En revanche, les citoyens d'un certain nombre de nos pays ont été victimes de catastrophes naturelles qui ont vraiment été terribles.
Vous avez souhaité un débat à ce sujet dans les prochains jours, au cours de cette période de session.
En attendant, je souhaiterais, comme un certain nombre de collègues me l'ont demandé, que nous observions une minute de silence pour toutes les victimes, des tempêtes notamment, dans les différents pays de l'Union européenne qui ont été touchés.

Reprise de la session

Je déclare reprise la session du Parlement européen qui avait été interrompue le vendredi 17 décembre dernier et je vous renouvelle tous mes vux en espérant que vous avez passé de bonnes vacances.

Comme vous avez pu le constater, le grand "bogue de l'an 2000" ne s'est pas produit. En revanche, les citoyens d'un certain nombre de nos pays ont été victimes de catastrophes naturelles qui ont vraiment été terribles.

Vous avez souhaité un débat à ce sujet dans les prochains jours, au cours de cette période de session.

En attendant, je souhaiterais, comme un certain nombre de collègues me l'ont demandé, que nous observions une minute de silence pour toutes les victimes, des tempêtes notamment, dans les différents pays de l'Union européenne qui ont été touchés.

Load Dataset

Let’s start off by loading the data files.

We can load each file as a string. Because the files contain unicode characters, we must specify an encoding when loading the files as text. In this case, we will use UTF-8 that will easily handle the unicode characters in both files.

The function below, named load_doc(), will load a given file and return it as a blob of text.

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, mode='rt', encoding='utf-8')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# load doc into memory

def load_doc(filename):

# open the file as read only

file = open(filename, mode='rt', encoding='utf-8')

# read all text

text = file.read()

# close the file

file.close()

return text

Next, we can split the file into sentences.

Generally, one utterance is stored on each line. We can treat these as sentences and split the file by new line characters. The function to_sentences() below will split a loaded document.

# split a loaded document into sentences
def to_sentences(doc):
	return doc.strip().split('\n')

# split a loaded document into sentences

def to_sentences(doc):

return doc.strip().split('\n')

When preparing our model later, we will need to know the length of sentences in the dataset. We can write a short function to calculate the shortest and longest sentences.

# shortest and longest sentence lengths
def sentence_lengths(sentences):
	lengths = [len(s.split()) for s in sentences]
	return min(lengths), max(lengths)

# shortest and longest sentence lengths

def sentence_lengths(sentences):

lengths = [len(s.split()) for s in sentences]

return min(lengths), max(lengths)

We can tie all of this together to load and summarize the English and French data files. The complete example is listed below.

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, mode='rt', encoding='utf-8')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# split a loaded document into sentences
def to_sentences(doc):
	return doc.strip().split('\n')

# shortest and longest sentence lengths
def sentence_lengths(sentences):
	lengths = [len(s.split()) for s in sentences]
	return min(lengths), max(lengths)

# load English data
filename = 'europarl-v7.fr-en.en'
doc = load_doc(filename)
sentences = to_sentences(doc)
minlen, maxlen = sentence_lengths(sentences)
print('English data: sentences=%d, min=%d, max=%d' % (len(sentences), minlen, maxlen))

# load French data
filename = 'europarl-v7.fr-en.fr'
doc = load_doc(filename)
sentences = to_sentences(doc)
minlen, maxlen = sentence_lengths(sentences)
print('French data: sentences=%d, min=%d, max=%d' % (len(sentences), minlen, maxlen))

# load doc into memory

def load_doc(filename):

# open the file as read only

file = open(filename, mode='rt', encoding='utf-8')

# read all text

text = file.read()

# close the file

file.close()

return text

# split a loaded document into sentences

def to_sentences(doc):

return doc.strip().split('\n')

# shortest and longest sentence lengths

def sentence_lengths(sentences):

lengths = [len(s.split()) for s in sentences]

return min(lengths), max(lengths)

# load English data

filename = 'europarl-v7.fr-en.en'

doc = load_doc(filename)

sentences = to_sentences(doc)

minlen, maxlen = sentence_lengths(sentences)

print('English data: sentences=%d, min=%d, max=%d' % (len(sentences), minlen, maxlen))

# load French data

filename = 'europarl-v7.fr-en.fr'

doc = load_doc(filename)

sentences = to_sentences(doc)

minlen, maxlen = sentence_lengths(sentences)

print('French data: sentences=%d, min=%d, max=%d' % (len(sentences), minlen, maxlen))

Running the example summarizes the number of lines or sentences in each file and the length of the longest and shortest lines in each file.

English data: sentences=2007723, min=0, max=668
French data: sentences=2007723, min=0, max=693

1 2	English data: sentences=2007723, min=0, max=668 French data: sentences=2007723, min=0, max=693

Importantly, we can see that the number of lines 2,007,723 matches the expectation.

Clean Dataset

The data needs some minimal cleaning before being used to train a neural translation model.

Looking at some samples of text, some minimal text cleaning may include:

Tokenizing text by white space.
Normalizing case to lowercase.
Removing punctuation from each word.
Removing non-printable characters.
Converting French characters to Latin characters.
Removing words that contain non-alphabetic characters.

These are just some basic operations as a starting point; you may know of or require more elaborate data cleaning operations.

The function clean_lines() below implements these cleaning operations. Some notes:

We use the unicode API to normalize unicode characters, which converts French characters to Latin equivalents.
We use an inverse regex match to retain only those characters in words that are printable.
We use a translation table to translate characters as-is, but exclude all punctuation characters.

# clean a list of lines
def clean_lines(lines):
	cleaned = list()
	# prepare regex for char filtering
	re_print = re.compile('[^%s]' % re.escape(string.printable))
	# prepare translation table for removing punctuation
	table = str.maketrans('', '', string.punctuation)
	for line in lines:
		# normalize unicode characters
		line = normalize('NFD', line).encode('ascii', 'ignore')
		line = line.decode('UTF-8')
		# tokenize on white space
		line = line.split()
		# convert to lower case
		line = [word.lower() for word in line]
		# remove punctuation from each token
		line = [word.translate(table) for word in line]
		# remove non-printable chars form each token
		line = [re_print.sub('', w) for w in line]
		# remove tokens with numbers in them
		line = [word for word in line if word.isalpha()]
		# store as string
		cleaned.append(' '.join(line))
	return cleaned

# clean a list of lines

def clean_lines(lines):

cleaned = list()

# prepare regex for char filtering

re_print = re.compile('[^%s]' % re.escape(string.printable))

# prepare translation table for removing punctuation

table = str.maketrans('', '', string.punctuation)

for line in lines:

# normalize unicode characters

line = normalize('NFD', line).encode('ascii', 'ignore')

line = line.decode('UTF-8')

# tokenize on white space

line = line.split()

# convert to lower case

line = [word.lower() for word in line]

# remove punctuation from each token

line = [word.translate(table) for word in line]

# remove non-printable chars form each token

line = [re_print.sub('', w) for w in line]

# remove tokens with numbers in them

line = [word for word in line if word.isalpha()]

# store as string

cleaned.append(' '.join(line))

return cleaned

Once normalized, we save the lists of clean lines directly in binary format using the pickle API. This will speed up loading for further operations later and in the future.

Reusing the loading and splitting functions developed in the previous sections, the complete example is listed below.

import string
import re
from pickle import dump
from unicodedata import normalize

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, mode='rt', encoding='utf-8')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# split a loaded document into sentences
def to_sentences(doc):
	return doc.strip().split('\n')

# clean a list of lines
def clean_lines(lines):
	cleaned = list()
	# prepare regex for char filtering
	re_print = re.compile('[^%s]' % re.escape(string.printable))
	# prepare translation table for removing punctuation
	table = str.maketrans('', '', string.punctuation)
	for line in lines:
		# normalize unicode characters
		line = normalize('NFD', line).encode('ascii', 'ignore')
		line = line.decode('UTF-8')
		# tokenize on white space
		line = line.split()
		# convert to lower case
		line = [word.lower() for word in line]
		# remove punctuation from each token
		line = [word.translate(table) for word in line]
		# remove non-printable chars form each token
		line = [re_print.sub('', w) for w in line]
		# remove tokens with numbers in them
		line = [word for word in line if word.isalpha()]
		# store as string
		cleaned.append(' '.join(line))
	return cleaned

# save a list of clean sentences to file
def save_clean_sentences(sentences, filename):
	dump(sentences, open(filename, 'wb'))
	print('Saved: %s' % filename)

# load English data
filename = 'europarl-v7.fr-en.en'
doc = load_doc(filename)
sentences = to_sentences(doc)
sentences = clean_lines(sentences)
save_clean_sentences(sentences, 'english.pkl')
# spot check
for i in range(10):
	print(sentences[i])

# load French data
filename = 'europarl-v7.fr-en.fr'
doc = load_doc(filename)
sentences = to_sentences(doc)
sentences = clean_lines(sentences)
save_clean_sentences(sentences, 'french.pkl')
# spot check
for i in range(10):
	print(sentences[i])

import string

import re

from pickle import dump

from unicodedata import normalize

# load doc into memory

def load_doc(filename):

# open the file as read only

file = open(filename, mode='rt', encoding='utf-8')

# read all text

text = file.read()

# close the file

file.close()

return text

# split a loaded document into sentences

def to_sentences(doc):

return doc.strip().split('\n')

# clean a list of lines

def clean_lines(lines):

cleaned = list()

# prepare regex for char filtering

re_print = re.compile('[^%s]' % re.escape(string.printable))

# prepare translation table for removing punctuation

table = str.maketrans('', '', string.punctuation)

for line in lines:

# normalize unicode characters

line = normalize('NFD', line).encode('ascii', 'ignore')

line = line.decode('UTF-8')

# tokenize on white space

line = line.split()

# convert to lower case

line = [word.lower() for word in line]

# remove punctuation from each token

line = [word.translate(table) for word in line]

# remove non-printable chars form each token

line = [re_print.sub('', w) for w in line]

# remove tokens with numbers in them

line = [word for word in line if word.isalpha()]

# store as string

cleaned.append(' '.join(line))

return cleaned

# save a list of clean sentences to file

def save_clean_sentences(sentences, filename):

dump(sentences, open(filename, 'wb'))

print('Saved: %s' % filename)

# load English data

filename = 'europarl-v7.fr-en.en'

doc = load_doc(filename)

sentences = to_sentences(doc)

sentences = clean_lines(sentences)

save_clean_sentences(sentences, 'english.pkl')

# spot check

for i in range(10):

print(sentences[i])

# load French data

filename = 'europarl-v7.fr-en.fr'

doc = load_doc(filename)

sentences = to_sentences(doc)

sentences = clean_lines(sentences)

save_clean_sentences(sentences, 'french.pkl')

# spot check

for i in range(10):

print(sentences[i])

After running, the clean sentences are saved in english.pkl and french.pkl files respectively.

As part of the run, we also print the first few lines of each list of clean sentences, reproduced below.

English:

resumption of the session
i declare resumed the session of the european parliament adjourned on friday december and i would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period
although as you will have seen the dreaded millennium bug failed to materialise still the people in a number of countries suffered a series of natural disasters that truly were dreadful
you have requested a debate on this subject in the course of the next few days during this partsession
in the meantime i should like to observe a minute s silence as a number of members have requested on behalf of all the victims concerned particularly those of the terrible storms in the various countries of the european union
please rise then for this minute s silence
the house rose and observed a minute s silence
madam president on a point of order
you will be aware from the press and television that there have been a number of bomb explosions and killings in sri lanka
one of the people assassinated very recently in sri lanka was mr kumar ponnambalam who had visited the european parliament just a few months ago

resumption of the session

i declare resumed the session of the european parliament adjourned on friday december and i would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period

although as you will have seen the dreaded millennium bug failed to materialise still the people in a number of countries suffered a series of natural disasters that truly were dreadful

you have requested a debate on this subject in the course of the next few days during this partsession

in the meantime i should like to observe a minute s silence as a number of members have requested on behalf of all the victims concerned particularly those of the terrible storms in the various countries of the european union

please rise then for this minute s silence

the house rose and observed a minute s silence

madam president on a point of order

you will be aware from the press and television that there have been a number of bomb explosions and killings in sri lanka

one of the people assassinated very recently in sri lanka was mr kumar ponnambalam who had visited the european parliament just a few months ago

French:

reprise de la session
je declare reprise la session du parlement europeen qui avait ete interrompue le vendredi decembre dernier et je vous renouvelle tous mes vux en esperant que vous avez passe de bonnes vacances
comme vous avez pu le constater le grand bogue de lan ne sest pas produit en revanche les citoyens dun certain nombre de nos pays ont ete victimes de catastrophes naturelles qui ont vraiment ete terribles
vous avez souhaite un debat a ce sujet dans les prochains jours au cours de cette periode de session
en attendant je souhaiterais comme un certain nombre de collegues me lont demande que nous observions une minute de silence pour toutes les victimes des tempetes notamment dans les differents pays de lunion europeenne qui ont ete touches
je vous invite a vous lever pour cette minute de silence
le parlement debout observe une minute de silence
madame la presidente cest une motion de procedure
vous avez probablement appris par la presse et par la television que plusieurs attentats a la bombe et crimes ont ete perpetres au sri lanka
lune des personnes qui vient detre assassinee au sri lanka est m kumar ponnambalam qui avait rendu visite au parlement europeen il y a quelques mois a peine

reprise de la session

je declare reprise la session du parlement europeen qui avait ete interrompue le vendredi decembre dernier et je vous renouvelle tous mes vux en esperant que vous avez passe de bonnes vacances

comme vous avez pu le constater le grand bogue de lan ne sest pas produit en revanche les citoyens dun certain nombre de nos pays ont ete victimes de catastrophes naturelles qui ont vraiment ete terribles

vous avez souhaite un debat a ce sujet dans les prochains jours au cours de cette periode de session

en attendant je souhaiterais comme un certain nombre de collegues me lont demande que nous observions une minute de silence pour toutes les victimes des tempetes notamment dans les differents pays de lunion europeenne qui ont ete touches

je vous invite a vous lever pour cette minute de silence

le parlement debout observe une minute de silence

madame la presidente cest une motion de procedure

vous avez probablement appris par la presse et par la television que plusieurs attentats a la bombe et crimes ont ete perpetres au sri lanka

lune des personnes qui vient detre assassinee au sri lanka est m kumar ponnambalam qui avait rendu visite au parlement europeen il y a quelques mois a peine

My reading of French is very limited, but at least as the English is concerned, further improves could be made, such as dropping or concatenating hanging ‘s‘ characters for plurals.

Reduce Vocabulary

As part of the data cleaning, it is important to constrain the vocabulary of both the source and target languages.

The difficulty of the translation task is proportional to the size of the vocabularies, which in turn impacts model training time and the size of a dataset required to make the model viable.

In this section, we will reduce the vocabulary of both the English and French text and mark all out of vocabulary (OOV) words with a special token.

We can start by loading the pickled clean lines saved from the previous section. The load_clean_sentences() function below will load and return a list for a given filename.

# load a clean dataset
def load_clean_sentences(filename):
	return load(open(filename, 'rb'))

# load a clean dataset

def load_clean_sentences(filename):

return load(open(filename, 'rb'))

Next, we can count the occurrence of each word in the dataset. For this we can use a Counter object, which is a Python dictionary keyed on words and updates a count each time a new occurrence of each word is added.

The to_vocab() function below creates a vocabulary for a given list of sentences.

# create a frequency table for all words
def to_vocab(lines):
	vocab = Counter()
	for line in lines:
		tokens = line.split()
		vocab.update(tokens)
	return vocab

# create a frequency table for all words

def to_vocab(lines):

vocab = Counter()

for line in lines:

tokens = line.split()

vocab.update(tokens)

return vocab

We can then process the created vocabulary and remove all words from the Counter that have an occurrence below a specific threshold.

The trim_vocab() function below does this and accepts a minimum occurrence count as a parameter and returns an updated vocabulary.

# remove all words with a frequency below a threshold
def trim_vocab(vocab, min_occurance):
	tokens = [k for k,c in vocab.items() if c >= min_occurance]
	return set(tokens)

# remove all words with a frequency below a threshold

def trim_vocab(vocab, min_occurance):

tokens = [k for k,c in vocab.items() if c >= min_occurance]

return set(tokens)

Finally, we can update the sentences, remove all words not in the trimmed vocabulary and mark their removal with a special token, in this case, the string “unk“.

The update_dataset() function below performs this operation and returns a list of updated lines that can then be saved to a new file.

# mark all OOV with "unk" for all lines
def update_dataset(lines, vocab):
	new_lines = list()
	for line in lines:
		new_tokens = list()
		for token in line.split():
			if token in vocab:
				new_tokens.append(token)
			else:
				new_tokens.append('unk')
		new_line = ' '.join(new_tokens)
		new_lines.append(new_line)
	return new_lines

# mark all OOV with "unk" for all lines

def update_dataset(lines, vocab):

new_lines = list()

for line in lines:

new_tokens = list()

for token in line.split():

if token in vocab:

new_tokens.append(token)

else:

new_tokens.append('unk')

new_line = ' '.join(new_tokens)

new_lines.append(new_line)

return new_lines

We can tie all of this together and reduce the vocabulary for both the English and French dataset and save the results to new data files.

We will use a min occurrence of 5, but you are free to explore other min occurrence counts suitable for your application.

The complete code example is listed below.

from pickle import load
from pickle import dump
from collections import Counter

# load a clean dataset
def load_clean_sentences(filename):
	return load(open(filename, 'rb'))

# save a list of clean sentences to file
def save_clean_sentences(sentences, filename):
	dump(sentences, open(filename, 'wb'))
	print('Saved: %s' % filename)

# create a frequency table for all words
def to_vocab(lines):
	vocab = Counter()
	for line in lines:
		tokens = line.split()
		vocab.update(tokens)
	return vocab

# remove all words with a frequency below a threshold
def trim_vocab(vocab, min_occurance):
	tokens = [k for k,c in vocab.items() if c >= min_occurance]
	return set(tokens)

# mark all OOV with "unk" for all lines
def update_dataset(lines, vocab):
	new_lines = list()
	for line in lines:
		new_tokens = list()
		for token in line.split():
			if token in vocab:
				new_tokens.append(token)
			else:
				new_tokens.append('unk')
		new_line = ' '.join(new_tokens)
		new_lines.append(new_line)
	return new_lines

# load English dataset
filename = 'english.pkl'
lines = load_clean_sentences(filename)
# calculate vocabulary
vocab = to_vocab(lines)
print('English Vocabulary: %d' % len(vocab))
# reduce vocabulary
vocab = trim_vocab(vocab, 5)
print('New English Vocabulary: %d' % len(vocab))
# mark out of vocabulary words
lines = update_dataset(lines, vocab)
# save updated dataset
filename = 'english_vocab.pkl'
save_clean_sentences(lines, filename)
# spot check
for i in range(10):
	print(lines[i])

# load French dataset
filename = 'french.pkl'
lines = load_clean_sentences(filename)
# calculate vocabulary
vocab = to_vocab(lines)
print('French Vocabulary: %d' % len(vocab))
# reduce vocabulary
vocab = trim_vocab(vocab, 5)
print('New French Vocabulary: %d' % len(vocab))
# mark out of vocabulary words
lines = update_dataset(lines, vocab)
# save updated dataset
filename = 'french_vocab.pkl'
save_clean_sentences(lines, filename)
# spot check
for i in range(10):
	print(lines[i])

from pickle import load

from pickle import dump

from collections import Counter

# load a clean dataset

def load_clean_sentences(filename):

return load(open(filename, 'rb'))

# save a list of clean sentences to file

def save_clean_sentences(sentences, filename):

dump(sentences, open(filename, 'wb'))

print('Saved: %s' % filename)

# create a frequency table for all words

def to_vocab(lines):

vocab = Counter()

for line in lines:

tokens = line.split()

vocab.update(tokens)

return vocab

# remove all words with a frequency below a threshold

def trim_vocab(vocab, min_occurance):

tokens = [k for k,c in vocab.items() if c >= min_occurance]

return set(tokens)

# mark all OOV with "unk" for all lines

def update_dataset(lines, vocab):

new_lines = list()

for line in lines:

new_tokens = list()

for token in line.split():

if token in vocab:

new_tokens.append(token)

else:

new_tokens.append('unk')

new_line = ' '.join(new_tokens)

new_lines.append(new_line)

return new_lines

# load English dataset

filename = 'english.pkl'

lines = load_clean_sentences(filename)

# calculate vocabulary

vocab = to_vocab(lines)

print('English Vocabulary: %d' % len(vocab))

# reduce vocabulary

vocab = trim_vocab(vocab, 5)

print('New English Vocabulary: %d' % len(vocab))

# mark out of vocabulary words

lines = update_dataset(lines, vocab)

# save updated dataset

filename = 'english_vocab.pkl'

save_clean_sentences(lines, filename)

# spot check

for i in range(10):

print(lines[i])

# load French dataset

filename = 'french.pkl'

lines = load_clean_sentences(filename)

# calculate vocabulary

vocab = to_vocab(lines)

print('French Vocabulary: %d' % len(vocab))

# reduce vocabulary

vocab = trim_vocab(vocab, 5)

print('New French Vocabulary: %d' % len(vocab))

# mark out of vocabulary words

lines = update_dataset(lines, vocab)

# save updated dataset

filename = 'french_vocab.pkl'

save_clean_sentences(lines, filename)

# spot check

for i in range(10):

print(lines[i])

First, the size of the English vocabulary is reported followed by the updated size. The updated dataset is saved to the file ‘english_vocab.pkl‘ and a spot check of some updated examples with out of vocabulary words replace with “unk” are printed.

English Vocabulary: 105357
New English Vocabulary: 41746
Saved: english_vocab.pkl

English Vocabulary: 105357

New English Vocabulary: 41746

Saved: english_vocab.pkl

We can see that the size of the vocabulary was shrunk by about half to a little over 40,000 words.

resumption of the session
i declare resumed the session of the european parliament adjourned on friday december and i would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period
although as you will have seen the dreaded millennium bug failed to materialise still the people in a number of countries suffered a series of natural disasters that truly were dreadful
you have requested a debate on this subject in the course of the next few days during this partsession
in the meantime i should like to observe a minute s silence as a number of members have requested on behalf of all the victims concerned particularly those of the terrible storms in the various countries of the european union
please rise then for this minute s silence
the house rose and observed a minute s silence
madam president on a point of order
you will be aware from the press and television that there have been a number of bomb explosions and killings in sri lanka
one of the people assassinated very recently in sri lanka was mr unk unk who had visited the european parliament just a few months ago

resumption of the session

i declare resumed the session of the european parliament adjourned on friday december and i would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period

although as you will have seen the dreaded millennium bug failed to materialise still the people in a number of countries suffered a series of natural disasters that truly were dreadful

you have requested a debate on this subject in the course of the next few days during this partsession

please rise then for this minute s silence

the house rose and observed a minute s silence

madam president on a point of order

you will be aware from the press and television that there have been a number of bomb explosions and killings in sri lanka

one of the people assassinated very recently in sri lanka was mr unk unk who had visited the european parliament just a few months ago

The same procedure is then performed on the French dataset, saving the result to the file ‘french_vocab.pkl‘.

French Vocabulary: 141642
New French Vocabulary: 58800
Saved: french_vocab.pkl

French Vocabulary: 141642

New French Vocabulary: 58800

Saved: french_vocab.pkl

We see a similar shrinking of the size of the French vocabulary.

reprise de la session
je declare reprise la session du parlement europeen qui avait ete interrompue le vendredi decembre dernier et je vous renouvelle tous mes vux en esperant que vous avez passe de bonnes vacances
comme vous avez pu le constater le grand bogue de lan ne sest pas produit en revanche les citoyens dun certain nombre de nos pays ont ete victimes de catastrophes naturelles qui ont vraiment ete terribles
vous avez souhaite un debat a ce sujet dans les prochains jours au cours de cette periode de session
en attendant je souhaiterais comme un certain nombre de collegues me lont demande que nous observions une minute de silence pour toutes les victimes des tempetes notamment dans les differents pays de lunion europeenne qui ont ete touches
je vous invite a vous lever pour cette minute de silence
le parlement debout observe une minute de silence
madame la presidente cest une motion de procedure
vous avez probablement appris par la presse et par la television que plusieurs attentats a la bombe et crimes ont ete perpetres au sri lanka
lune des personnes qui vient detre assassinee au sri lanka est m unk unk qui avait rendu visite au parlement europeen il y a quelques mois a peine

reprise de la session

je declare reprise la session du parlement europeen qui avait ete interrompue le vendredi decembre dernier et je vous renouvelle tous mes vux en esperant que vous avez passe de bonnes vacances

vous avez souhaite un debat a ce sujet dans les prochains jours au cours de cette periode de session

je vous invite a vous lever pour cette minute de silence

le parlement debout observe une minute de silence

madame la presidente cest une motion de procedure

vous avez probablement appris par la presse et par la television que plusieurs attentats a la bombe et crimes ont ete perpetres au sri lanka

lune des personnes qui vient detre assassinee au sri lanka est m unk unk qui avait rendu visite au parlement europeen il y a quelques mois a peine

Summary

In this tutorial, you discovered the Europarl machine translation dataset and how to prepare the data ready for modeling.

Specifically, you learned:

The Europarl dataset comprised of the proceedings from the European Parliament in a host of 11 languages.
How to load and clean the parallel French and English transcripts ready for modeling in a neural machine translation system.
How to reduce the vocabulary size of both French and English data in order to reduce the complexity of the translation task.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

56 Responses to How to Prepare a French-to-English Dataset for Machine Translation

Gerrit Govaerts January 8, 2018 at 6:54 pm #

A bit off topic , but very sharp observations about the remarkable success of recurrent and convolutional neural nets and why basic multi layer perceptrons are probably not worth the effort : http://www.stochasticlifestyle.com/algorithm-efficiency-comes-problem-information/

Reply
- Jason Brownlee January 9, 2018 at 5:26 am #
  
  Thanks for sharing.
  
  Reply
Klaas January 9, 2018 at 6:20 am #

This is again outstanding work Jason. Thanks so much for sharing. This is really really helpful. Especially for people like me who lack the scientific / mathematic background but are very interested in learning this none the less.
Highly appreciate your work!
Best regards

Reply
- Jason Brownlee January 9, 2018 at 3:17 pm #
  
  Thanks, I’m glad it helped.
  
  Reply
  - riyaj atar December 10, 2020 at 3:27 am #
    
    thanks Jason . once again its great tutorial.
    i have one question regarding creating dataset in tfds format for parallel corpus for machine translation task.
    can you give few steps how to create dataset in that format for our own dataset ?
    i appreciate your time and efforts.
    thanks .stay healthy.
    
    Reply
    - Jason Brownlee December 10, 2020 at 6:30 am #
      
      Thanks.
      
      Sorry, I don’t undertand your question. Can you please elaborate the problem you’re having.
      
      Reply
Vidyush Bakshi January 9, 2018 at 9:38 pm #

Great work again , really good explaination!!

Reply
- Jason Brownlee January 10, 2018 at 5:25 am #
  
  Thanks.
  
  Reply
Canbey Bilgili January 13, 2018 at 1:25 am #

Great article. It is a good source for preparing data. Thank you!

Reply
- Jason Brownlee January 13, 2018 at 5:34 am #
  
  I’m glad to hear that.
  
  Reply
LeeX January 22, 2018 at 2:28 am #

China Researchers very appreciate your tutorials!

Reply
- Jason Brownlee January 22, 2018 at 4:46 am #
  
  Thanks!
  
  Reply
Nixon February 7, 2018 at 4:13 am #

Hi brother i am new learner how study machine learning to easy way please help me

Reply
- Jason Brownlee February 7, 2018 at 9:27 am #
  
  Start right here:
  https://machinelearningmastery.com/start-here/
  
  Reply
mzeid February 26, 2018 at 11:48 am #

Hi Jason,

This is a wonderful article indeed. I am trying to follow your guide on English>Arabic data, but the function ‘clean_lines(lines)’ when used with Arabic text, doesn’t yield any results. Any idea how to fix this in Arabic?

Thanks in advance!

Reply
- Jason Brownlee February 26, 2018 at 2:55 pm #
  
  Sorry, I have not worked with Arabic. Perhaps the function needs to be updated to support unicode chars?
  
  Reply
machine_translator April 6, 2018 at 9:19 pm #

Thanks a lot for this very clear tutorial on how to prepare data for machine translation. What would be the next steps? Are similar tutorials available for those steps?

Reply
- Jason Brownlee April 7, 2018 at 6:31 am #
  
  Yes, I cover working through a whole project in my book:
  https://machinelearningmastery.com/deep-learning-for-nlp/
  
  Reply
  - Rokaya May 16, 2022 at 4:37 pm #
    
    i am working on multilingual translation model can you help me even your book where i can buy it please!
    
    Reply
    - James Carmichael May 17, 2022 at 9:55 am #
      
      Hi Rokaya…You may find the following resource very helpful:
      
      https://machinelearningmastery.com/deep-learning-for-nlp/
      
      Reply
Zayed April 11, 2018 at 8:45 am #

Great and useful tutorial.

I would like to save the files to plain text files ‘.txt’ in UTF-8 format and I don’t need pickle files.

What do I need to change in the code above to make it output text files?

Reply
- Jason Brownlee April 11, 2018 at 4:15 pm #
  
  Perhaps you could save the vocab with one line per word.
  
  You could save the translations one per line.
  
  To do this, you could write a function to save a list to an ASCII file using the standard Python API and call it instead of the pickle function.
  
  Reply
  - Zayed April 12, 2018 at 2:53 am #
    
    Thanks Jason for your reply. I don’t need even the vocab. I am stuck at this function.
    
    # save a list of clean sentences to file
    def save_clean_sentences(sentences, filename):
    dump(sentences, open(filename, ‘wb’))
    print(‘Saved: %s’ % filename)
    
    I tried this and I can see the spot check result (10 lines of each language) in PyCharm, but nothing is written to the files.
    
    def save_clean_sentences(sentences, filename):
    f = open(filename, ‘r+’)
    for line in f:
    f.write(line[i], ‘r+’)
    f.write(‘\n’)
    
    What am I missing?
    
    Thanks again for taking the time to support me.
    
    Reply
    - Jason Brownlee April 12, 2018 at 8:49 am #
      
      Perhaps ensure to close the file after writing?
      
      Reply
Zayed April 14, 2018 at 3:38 am #

I have a question about removing punctuation from data. In your example above, you see sentences like this:

“please rise then for this minute s silence”
“the house rose and observed a minute s silence”

As you can see, apostrophe is removed from sentences. So, does this mean that I try to translate the same sentence, but with the apostrophe “please rise then for this minute’s silence”, the neural decoder won’t be able to pick the correct French translation or the translation would be different since now the source is a bit different.

Would the translation be different, if the same source sentence has a period at the end or starts with a capital letter? For example:

Please rise then for this minute’s silence
Please rise then for this minute’s silence.

Is removing punctuation from training data the standard? Does it improve the overall quality? Any pointers please!

Reply
- Jason Brownlee April 14, 2018 at 6:50 am #
  
  I removed it (or it was removed from the training data prior, I don’t recall) to simplify the problem.
  
  I would recommend adding it back in (not stripping it from the training data, if it is present or get data with punctuation) to learn the translation with punctuation.
  
  It is standard when focusing on the translation part, but not for a working real-world model.
  
  Alternately, you could develop a model to add punctuation back in.
  
  Reply
  - Zayed April 14, 2018 at 7:17 am #
    
    Thanks Jason! This makes sense.
    
    I have another question about lower-casing. If you lower-case all training data, would the neural decoder be able to capitalize the beginning of the target sentence or keep unknown words that were not seen in the training data as is? Or is it going to lower-case all words during decoding/translation? Say for example we have this sentence:
    
    IBM is providing AI services.
    
    Would the neural decoder be able to render IBM and AI as is if it was trained on lower-cased data only?
    
    Thanks again for your support and I hope you don’t mind my frequent questions. Please also let me know which one of your books cover neural machine translation in detail? I am mainly interested in creating neural machine translation systems and neural spell-checker. Do you cover neural-based spell-checking in any of your books?
    
    Thanks again!
    
    Reply
    - Jason Brownlee April 15, 2018 at 6:17 am #
      
      If all training data is lower case, then the model only knows about lower case.
      
      If case is important, you can train with case preserved or train a model to add case to lower case strings, or other clever ideas…
      
      Reply
Dominique Lahaix September 20, 2018 at 7:50 am #

Hi Jason – got a question maybe you can help?

we’ve build a system using ML that automatically categorize short documents. we’ve done it in English and we need to do the same for French. We did supervised learning using a very large corpus of manually annotated documents.

Unfortunately our French training set is way smaller … so I was wondering whether we could:

– translate the training set and use (complement) this a sa training set for French
– translate the model itself (I even don’t know whether this is an option)

Have you heard about people using translated training set to build model? does it work OK?
Thanks

Reply
- Jason Brownlee September 20, 2018 at 8:11 am #
  
  Sound like great ideas!
  
  Maybe generate new data to train with, as augmented versions of your existing documents.
  
  Also, when using smaller datasets, consider regularization methods to ensure you do not overfit the training data.
  
  Reply
simran December 26, 2018 at 3:44 pm #

greetings sir,
i am doing my Ph.D in Corpus linguistics using machine learning. i need help for developing a preprocessing algorithm for corpus before translation.

Reply
- Jason Brownlee December 27, 2018 at 5:39 am #
  
  It is not an algorithm, instead it is a sequence of preprocessing steps that are most appropriate for your specific dataset.
  
  Reply
Prashant Kumar Singh March 14, 2019 at 9:45 pm #

Hi Jason, Is it possible to translate language from English to other language i.e. French.

My project example is as below;

I’m working in existdb and generating the PDF files which is published globally on the webpage. BUT I want to change that PDF content language country wise. So, is it possible to do with your blog.

The changeling task is how to club python (currently in your blog) within existdb (Open source database). OR any other way to do this. Please help me to understand.

Thanks,
Prashant

Reply
- Jason Brownlee March 15, 2019 at 5:30 am #
  
  Tanks for the suggestion.
  
  Reply
  - Prashant Kumar Singh March 20, 2019 at 4:07 pm #
    
    Hi Jason,
    
    Can you please answer me on below point which will be helpful for me;
    
    Question: Is it possible to connect ML model to my webpage, which is based on exist (XML) database content? And please suggest me the steps to follow.
    
    Thanks,
    Prashant
    
    Reply
    - Jason Brownlee March 21, 2019 at 7:58 am #
      
      I don’t see why not.
      
      It sounds like an engineering question and will depend on the specifics of your production environment. I don’t have a worked example, sorry.
      
      Reply
anvesh June 10, 2019 at 7:27 pm #

can we use a english to french pretained model to train on my small different dataset then translate english to any other language

Reply
- Jason Brownlee June 11, 2019 at 7:52 am #
  
  It might help as a starting point, but further training will be required.
  
  Reply
Dani Gross June 15, 2019 at 6:44 pm #

Hi Jason,
thank you for your tutorials!

How do you handle sentence alignment with this corpus, given that it contains empty strings?

Reply
- Jason Brownlee June 16, 2019 at 7:12 am #
  
  Sorry, I don’t have a tutorial on sentence alignment, I cannot give you good off the cuff advice.
  
  Reply
Rishai August 27, 2019 at 9:07 am #

Hi Jason, thanks for all these tutorials. Would you have a tutorial on how to go the next step, of converting tokens/words to integer vectors, so that they can be passed into an Embedding layer?

Reply
- Jason Brownlee August 27, 2019 at 2:07 pm #
  
  Yes, a good place to get started would be here:
  https://machinelearningmastery.com/start-here/#nlp
  
  Reply
Sreenivas Kashyap October 26, 2019 at 3:45 am #

Hi Jason,
Good Work , I’m Developing Translation Model From Kannada to English But The tokenizer doesn’t work while split the text.
OUTPUT IS LIKE THIS:

Saved: english-kannada.pkl
[tom woke up] => []
[give me half] => []
[we needed it] => []
[tom liked you] => []
[just go inside] => []
[do you remember] => []
[i just got back] => []
[see you at] => []
Can you suggest me in solving the above problem..

Reply
- Jason Brownlee October 26, 2019 at 4:41 am #
  
  Sorry to hear that, I have some suggestions here:
  https://machinelearningmastery.com/faq/single-faq/why-does-the-code-in-the-tutorial-not-work-for-me
  
  Reply
Nithin December 5, 2019 at 5:58 am #

Hello Jason,

Thanks for the tutorial and a good explanation.
I would greatly appreciate if you can clarify the following doubts:
1) How will tokenizing the less frequent words with ‘unk’ affect the model accuracy, because now the token ‘unk’ might be a significant proportion of the data.
2) Do you have any comments on using character level encoding vs work level encodings.
3) With a larger vocabulary size (around 0.1M), are there techniques and softwares to use the sparse representation of words, instead of one-hot encoding, to reduce the memory requirements while training.

Thanks

Reply
- Jason Brownlee December 5, 2019 at 6:44 am #
  
  Remove those words from the vocab, then mark words not in the vocab as unk when pre-processing text for modeling.
  
  Modeling the world level is more efficient for now, as far as I know.
  
  100K vocab is modest. Don’t worry about it.
  
  Reply
  - Nithin December 7, 2019 at 4:55 am #
    
    Hello Jason,
    
    Thanks for the answers.
    We are working on using encoder decoder model on europarl. We are using a 2 GRU layers with 128 cells and a time distributed layer. The vocabulary for French after following the tutorial is around 50K (with 5 as threshold). We want to train this on a single GPU with 12 GB of GPU memory, but with batch sizes of 16 or 32, the GPU memory fills up and gives the memory full error.
    The most probable reason that we suspect for this is because of the one hot representation of each word as a vector of 50K dimension.
    Also the time distributed layer has shape (None, 528 (max size of French sentence), 50K (output vector size = vocabulary of French).
    We wanted to know if there is any way by which we can avoid this (for ex. like libSVM) or more efficient representations for RNN with large vocabulary size to help in training with large batch sizes.
    
    Thank you
    
    Reply
    - Jason Brownlee December 7, 2019 at 5:41 am #
      
      Perhaps use a generator to achieve progressive loading of batches?
      
      Reply
      - Nithin December 7, 2019 at 7:59 am #
        
        Yes, we are using generator. The module fails saying that it faced out of memory while creating a tensor of size [19136,58802] ([(max French sentence length)*batch_size, French vocabulary size]), for max_french_sentence_size around 600 and batch size 1024. Which seems correct, as this matrix would be around 8GB for 8byte float representation. So we were wondering if there is any way to solve this issue and what are the state-of-the-art methods to solve these issues.
        
        Thank you
      - Jason Brownlee December 8, 2019 at 6:03 am #
        
        Perhaps try using a small vocab?
        Perhaps try using a smaller batch size?
        Perhaps try using a smaller sentence length?
        Perhaps try training on a machine with more memory?
Jane January 16, 2020 at 1:28 pm #

Is this considered as a seq2seq model?

Reply
- Jason Brownlee January 16, 2020 at 1:34 pm #
  
  Yes.
  
  Reply
S.Gowri pooja May 17, 2020 at 11:09 am #

Hi,jason . Thanks for sharing this article. How this pre process model vary for different languages?

Reply
- Jason Brownlee May 18, 2020 at 6:07 am #
  
  Good question, I don’t know off hand.
  
  Reply
ARABA AMAN December 24, 2021 at 6:12 pm #

I working on machine translation between Amharic and Afaan Oromo
I preprare dataset on diffrent sheet with its corresponding target languages

so how can I train from two file name for neural machine translation?
that means how to feed this cleaned sentence to train model

Reply
- James Carmichael February 18, 2022 at 1:07 pm #
  
  Hi Araba…I would suggest a seq2seq model as discussed here:
  
  https://machinelearningmastery.com/develop-encoder-decoder-model-sequence-sequence-prediction-keras/
  
  Reply

Navigation

How to Prepare a French-to-English Dataset for Machine Translation

Tutorial Overview

Python Environment

Need help with Deep Learning for Text Data?

Europarl Machine Translation Dataset

Download French-English Dataset

Load Dataset

Clean Dataset

Reduce Vocabulary

Further Reading

Summary

Develop Deep Learning models for Text Data Today!

Develop Your Own Text models in Minutes

Finally Bring Deep Learning to your Natural Language Processing Projects

More On This Topic

56 Responses to How to Prepare a French-to-English Dataset for Machine Translation

Leave a Reply Click here to cancel reply.