How to Prepare a French-to-English Dataset for Machine Translation

Machine translation is the challenging task of converting text from a source language into coherent and matching text in a target language.

Neural machine translation systems such as encoder-decoder recurrent neural networks are achieving state-of-the-art results for machine translation with a single end-to-end system trained directly on source and target language.

Standard datasets are required to develop, explore, and familiarize yourself with how to develop neural machine translation systems.

In this tutorial, you will discover the Europarl standard machine translation dataset and how to prepare the data for modeling.

After completing this tutorial, you will know:

  • The Europarl dataset comprised of the proceedings from the European Parliament in a host of 11 languages.
  • How to load and clean the parallel French and English transcripts ready for modeling in a neural machine translation system.
  • How to reduce the vocabulary size of both French and English data in order to reduce the complexity of the translation task.

Let’s get started.

How to Prepare a French-to-English Dataset for Machine Translation

How to Prepare a French-to-English Dataset for Machine Translation
Photo by Giuseppe Milo, some rights reserved.

Tutorial Overview

This tutorial is divided into 5 parts; they are:

  1. Europarl Machine Translation Dataset
  2. Download French-English Dataset
  3. Load Dataset
  4. Clean Dataset
  5. Reduce Vocabulary

Python Environment

This tutorial assumes you have a Python SciPy environment installed with Python 3 installed.

The tutorial also assumes you have scikit-learn, Pandas, NumPy, and Matplotlib installed.

If you need help with your environment, see this post:

Need help with Deep Learning for Text Data?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Start Your FREE Crash-Course Now

Europarl Machine Translation Dataset

The Europarl is a standard dataset used for statistical machine translation, and more recently, neural machine translation.

It is comprised of the proceedings of the European Parliament, hence the name of the dataset as the contraction Europarl.

The proceedings are the transcriptions of speakers at the European Parliament, which are translated into 11 different languages.

It is a collection of the proceedings of the European Parliament, dating back to 1996. Altogether, the corpus comprises of about 30 million words for each of the 11 official languages of the European Union

Europarl: A Parallel Corpus for Statistical Machine Translation, 2005.

The raw data is available on the European Parliament website in HTML format.

The creation of the dataset was lead by Philipp Koehn, author of the book “Statistical Machine Translation.”

The dataset was made available for free to researchers on the website “European Parliament Proceedings Parallel Corpus 1996-2011,” and often appears as a part of machine translation challenges, such as the Machine Translation task in the 2014 Workshop on Statistical Machine Translation.

The most recent version of the dataset is version 7, released in 2012, comprised of data from 1996 to 2011.

Download French-English Dataset

We will focus on the parallel French-English dataset.

This is a prepared corpus of aligned French and English sentences recorded between 1996 and 2011.

The dataset has the following statistics:

  • Sentences: 2,007,723
  • French words: 51,388,643
  • English words: 50,196,035

You can download the dataset from here:

Once downloaded, you should have the file “fr-en.tgz” in your current working directory.

You can unzip this archive file using the tar command, as follows:

You will now have two files, as follows:

  • English: europarl-v7.fr-en.en (288M)
  • French: europarl-v7.fr-en.fr (331M)

Below is a sample of the English file.

Below is a sample of the French file.

Load Dataset

Let’s start off by loading the data files.

We can load each file as a string. Because the files contain unicode characters, we must specify an encoding when loading the files as text. In this case, we will use UTF-8 that will easily handle the unicode characters in both files.

The function below, named load_doc(), will load a given file and return it as a blob of text.

Next, we can split the file into sentences.

Generally, one utterance is stored on each line. We can treat these as sentences and split the file by new line characters. The function to_sentences() below will split a loaded document.

When preparing our model later, we will need to know the length of sentences in the dataset. We can write a short function to calculate the shortest and longest sentences.

We can tie all of this together to load and summarize the English and French data files. The complete example is listed below.

Running the example summarizes the number of lines or sentences in each file and the length of the longest and shortest lines in each file.

Importantly, we can see that the number of lines 2,007,723 matches the expectation.

Clean Dataset

The data needs some minimal cleaning before being used to train a neural translation model.

Looking at some samples of text, some minimal text cleaning may include:

  • Tokenizing text by white space.
  • Normalizing case to lowercase.
  • Removing punctuation from each word.
  • Removing non-printable characters.
  • Converting French characters to Latin characters.
  • Removing words that contain non-alphabetic characters.

These are just some basic operations as a starting point; you may know of or require more elaborate data cleaning operations.

The function clean_lines() below implements these cleaning operations. Some notes:

  • We use the unicode API to normalize unicode characters, which converts French characters to Latin equivalents.
  • We use an inverse regex match to retain only those characters in words that are printable.
  • We use a translation table to translate characters as-is, but exclude all punctuation characters.

Once normalized, we save the lists of clean lines directly in binary format using the pickle API. This will speed up loading for further operations later and in the future.

Reusing the loading and splitting functions developed in the previous sections, the complete example is listed below.

After running, the clean sentences are saved in english.pkl and french.pkl files respectively.

As part of the run, we also print the first few lines of each list of clean sentences, reproduced below.

English:

French:

My reading of French is very limited, but at least as the English is concerned, further improves could be made, such as dropping or concatenating hanging ‘s‘ characters for plurals.

Reduce Vocabulary

As part of the data cleaning, it is important to constrain the vocabulary of both the source and target languages.

The difficulty of the translation task is proportional to the size of the vocabularies, which in turn impacts model training time and the size of a dataset required to make the model viable.

In this section, we will reduce the vocabulary of both the English and French text and mark all out of vocabulary (OOV) words with a special token.

We can start by loading the pickled clean lines saved from the previous section. The load_clean_sentences() function below will load and return a list for a given filename.

Next, we can count the occurrence of each word in the dataset. For this we can use a Counter object, which is a Python dictionary keyed on words and updates a count each time a new occurrence of each word is added.

The to_vocab() function below creates a vocabulary for a given list of sentences.

We can then process the created vocabulary and remove all words from the Counter that have an occurrence below a specific threshold.

The trim_vocab() function below does this and accepts a minimum occurrence count as a parameter and returns an updated vocabulary.

Finally, we can update the sentences, remove all words not in the trimmed vocabulary and mark their removal with a special token, in this case, the string “unk“.

The update_dataset() function below performs this operation and returns a list of updated lines that can then be saved to a new file.

We can tie all of this together and reduce the vocabulary for both the English and French dataset and save the results to new data files.

We will use a min occurrence of 5, but you are free to explore other min occurrence counts suitable for your application.

The complete code example is listed below.

First, the size of the English vocabulary is reported followed by the updated size. The updated dataset is saved to the file ‘english_vocab.pkl‘ and a spot check of some updated examples with out of vocabulary words replace with “unk” are printed.

We can see that the size of the vocabulary was shrunk by about half to a little over 40,000 words.

The same procedure is then performed on the French dataset, saving the result to the file ‘french_vocab.pkl‘.

We see a similar shrinking of the size of the French vocabulary.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Summary

In this tutorial, you discovered the Europarl machine translation dataset and how to prepare the data ready for modeling.

Specifically, you learned:

  • The Europarl dataset comprised of the proceedings from the European Parliament in a host of 11 languages.
  • How to load and clean the parallel French and English transcripts ready for modeling in a neural machine translation system.
  • How to reduce the vocabulary size of both French and English data in order to reduce the complexity of the translation task.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.


Develop Deep Learning models for Text Data Today!

Deep Learning for Natural Language Processing

Develop Your Own Text models in Minutes

…with just a few lines of python code

Discover how in my new Ebook:
Deep Learning for Natural Language Processing

It provides self-study tutorials on topics like:
Bag-of-Words, Word Embedding, Language Models, Caption Generation, Text Translation and much more…

Finally Bring Deep Learning to your Natural Language Processing Projects

Skip the Academics. Just Results.

Click to learn more.


24 Responses to How to Prepare a French-to-English Dataset for Machine Translation

  1. Gerrit Govaerts January 8, 2018 at 6:54 pm #

    A bit off topic , but very sharp observations about the remarkable success of recurrent and convolutional neural nets and why basic multi layer perceptrons are probably not worth the effort : http://www.stochasticlifestyle.com/algorithm-efficiency-comes-problem-information/

  2. Klaas January 9, 2018 at 6:20 am #

    This is again outstanding work Jason. Thanks so much for sharing. This is really really helpful. Especially for people like me who lack the scientific / mathematic background but are very interested in learning this none the less.
    Highly appreciate your work!
    Best regards

  3. Vidyush Bakshi January 9, 2018 at 9:38 pm #

    Great work again , really good explaination!!

  4. Canbey Bilgili January 13, 2018 at 1:25 am #

    Great article. It is a good source for preparing data. Thank you!

  5. LeeX January 22, 2018 at 2:28 am #

    China Researchers very appreciate your tutorials!

  6. Nixon February 7, 2018 at 4:13 am #

    Hi brother i am new learner how study machine learning to easy way please help me

  7. mzeid February 26, 2018 at 11:48 am #

    Hi Jason,

    This is a wonderful article indeed. I am trying to follow your guide on English>Arabic data, but the function ‘clean_lines(lines)’ when used with Arabic text, doesn’t yield any results. Any idea how to fix this in Arabic?

    Thanks in advance!

    • Jason Brownlee February 26, 2018 at 2:55 pm #

      Sorry, I have not worked with Arabic. Perhaps the function needs to be updated to support unicode chars?

  8. machine_translator April 6, 2018 at 9:19 pm #

    Thanks a lot for this very clear tutorial on how to prepare data for machine translation. What would be the next steps? Are similar tutorials available for those steps?

  9. Zayed April 11, 2018 at 8:45 am #

    Great and useful tutorial.

    I would like to save the files to plain text files ‘.txt’ in UTF-8 format and I don’t need pickle files.

    What do I need to change in the code above to make it output text files?

    • Jason Brownlee April 11, 2018 at 4:15 pm #

      Perhaps you could save the vocab with one line per word.

      You could save the translations one per line.

      To do this, you could write a function to save a list to an ASCII file using the standard Python API and call it instead of the pickle function.

      • Zayed April 12, 2018 at 2:53 am #

        Thanks Jason for your reply. I don’t need even the vocab. I am stuck at this function.

        # save a list of clean sentences to file
        def save_clean_sentences(sentences, filename):
        dump(sentences, open(filename, ‘wb’))
        print(‘Saved: %s’ % filename)

        I tried this and I can see the spot check result (10 lines of each language) in PyCharm, but nothing is written to the files.

        def save_clean_sentences(sentences, filename):
        f = open(filename, ‘r+’)
        for line in f:
        f.write(line[i], ‘r+’)
        f.write(‘\n’)

        What am I missing?

        Thanks again for taking the time to support me.

  10. Zayed April 14, 2018 at 3:38 am #

    I have a question about removing punctuation from data. In your example above, you see sentences like this:

    “please rise then for this minute s silence”
    “the house rose and observed a minute s silence”

    As you can see, apostrophe is removed from sentences. So, does this mean that I try to translate the same sentence, but with the apostrophe “please rise then for this minute’s silence”, the neural decoder won’t be able to pick the correct French translation or the translation would be different since now the source is a bit different.

    Would the translation be different, if the same source sentence has a period at the end or starts with a capital letter? For example:

    Please rise then for this minute’s silence
    Please rise then for this minute’s silence.

    Is removing punctuation from training data the standard? Does it improve the overall quality? Any pointers please!

    • Jason Brownlee April 14, 2018 at 6:50 am #

      I removed it (or it was removed from the training data prior, I don’t recall) to simplify the problem.

      I would recommend adding it back in (not stripping it from the training data, if it is present or get data with punctuation) to learn the translation with punctuation.

      It is standard when focusing on the translation part, but not for a working real-world model.

      Alternately, you could develop a model to add punctuation back in.

      • Zayed April 14, 2018 at 7:17 am #

        Thanks Jason! This makes sense.

        I have another question about lower-casing. If you lower-case all training data, would the neural decoder be able to capitalize the beginning of the target sentence or keep unknown words that were not seen in the training data as is? Or is it going to lower-case all words during decoding/translation? Say for example we have this sentence:

        IBM is providing AI services.

        Would the neural decoder be able to render IBM and AI as is if it was trained on lower-cased data only?

        Thanks again for your support and I hope you don’t mind my frequent questions. Please also let me know which one of your books cover neural machine translation in detail? I am mainly interested in creating neural machine translation systems and neural spell-checker. Do you cover neural-based spell-checking in any of your books?

        Thanks again!

        • Jason Brownlee April 15, 2018 at 6:17 am #

          If all training data is lower case, then the model only knows about lower case.

          If case is important, you can train with case preserved or train a model to add case to lower case strings, or other clever ideas…

Leave a Reply