SALE! Use code blackfriday for 40% off everything!
Hurry, sale ends soon! Click to see the full catalog.

How to Develop a Neural Machine Translation System from Scratch

Develop a Deep Learning Model to Automatically
Translate from German to English in Python with Keras, Step-by-Step.

Machine translation is a challenging task that traditionally involves large statistical models developed using highly sophisticated linguistic knowledge.

Neural machine translation is the use of deep neural networks for the problem of machine translation.

In this tutorial, you will discover how to develop a neural machine translation system for translating German phrases to English.

After completing this tutorial, you will know:

  • How to clean and prepare data ready to train a neural machine translation system.
  • How to develop an encoder-decoder model for machine translation.
  • How to use a trained model for inference on new input phrases and evaluate the model skill.

Kick-start your project with my new book Deep Learning for Natural Language Processing, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

  • Update Apr/2019: Fixed bug in the calculation of BLEU score (Zhongpu Chen).
  • Update Oct/2020: Added direct link to original dataset.
How to Develop a Neural Machine Translation System in Keras

How to Develop a Neural Machine Translation System in Keras
Photo by Björn Groß, some rights reserved.

Tutorial Overview

This tutorial is divided into 4 parts; they are:

  1. German to English Translation Dataset
  2. Preparing the Text Data
  3. Train Neural Translation Model
  4. Evaluate Neural Translation Model

Python Environment

This tutorial assumes you have a Python 3 SciPy environment installed.

You must have Keras (2.0 or higher) installed with either the TensorFlow or Theano backend.

The tutorial also assumes you have NumPy and Matplotlib installed.

If you need help with your environment, see this post:

A GPU is not require for thus tutorial, nevertheless, you can access GPUs cheaply on Amazon Web Services. Learn how in this tutorial:

Let’s dive in.

Need help with Deep Learning for Text Data?

Take my free 7-day email crash course now (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

German to English Translation Dataset

In this tutorial, we will use a dataset of German to English terms used as the basis for flashcards for language learning.

The dataset is available from the ManyThings.org website, with examples drawn from the Tatoeba Project. The dataset is comprised of German phrases and their English counterparts and is intended to be used with the Anki flashcard software.

The page provides a list of many language pairs, and I encourage you to explore other languages:

Note, the original dataset has changed which if used directly will break this tutorial and result in an error:

As such you can download the original dataset in the correct format directly from here:

Download the dataset file to your current working directory.

You will have a file called deu.txt that contains 152,820 pairs of English to German phases, one pair per line with a tab separating the language.

For example, the first 5 lines of the file look as follows:

We will frame the prediction problem as given a sequence of words in German as input, translate or predict the sequence of words in English.

The model we will develop will be suitable for some beginner German phrases.

Preparing the Text Data

The next step is to prepare the text data ready for modeling.

If you are new to cleaning text data, see this post:

Take a look at the raw data and note what you see that we might need to handle in a data cleaning operation.

For example, here are some observations I note from reviewing the raw data:

  • There is punctuation.
  • The text contains uppercase and lowercase.
  • There are special characters in the German.
  • There are duplicate phrases in English with different translations in German.
  • The file is ordered by sentence length with very long sentences toward the end of the file.

Did you note anything else that could be important?
Let me know in the comments below.

A good text cleaning procedure may handle some or all of these observations.

Data preparation is divided into two subsections:

  1. Clean Text
  2. Split Text

1. Clean Text

First, we must load the data in a way that preserves the Unicode German characters. The function below called load_doc() will load the file as a blob of text.

Each line contains a single pair of phrases, first English and then German, separated by a tab character.

We must split the loaded text by line and then by phrase. The function to_pairs() below will split the loaded text.

We are now ready to clean each sentence. The specific cleaning operations we will perform are as follows:

  • Remove all non-printable characters.
  • Remove all punctuation characters.
  • Normalize all Unicode characters to ASCII (e.g. Latin characters).
  • Normalize the case to lowercase.
  • Remove any remaining tokens that are not alphabetic.

We will perform these operations on each phrase for each pair in the loaded dataset.

The clean_pairs() function below implements these operations.

Finally, now that the data has been cleaned, we can save the list of phrase pairs to a file ready for use.

The function save_clean_data() uses the pickle API to save the list of clean text to file.

Pulling all of this together, the complete example is listed below.

Running the example creates a new file in the current working directory with the cleaned text called english-german.pkl.

Some examples of the clean text are printed for us to evaluate at the end of the run to confirm that the clean operations were performed as expected.

2. Split Text

The clean data contains a little over 150,000 phrase pairs and some of the pairs toward the end of the file are very long.

This is a good number of examples for developing a small translation model. The complexity of the model increases with the number of examples, length of phrases, and size of the vocabulary.

Although we have a good dataset for modeling translation, we will simplify the problem slightly to dramatically reduce the size of the model required, and in turn the training time required to fit the model.

You can explore developing a model on the fuller dataset as an extension; I would love to hear how you do.

We will simplify the problem by reducing the dataset to the first 10,000 examples in the file; these will be the shortest phrases in the dataset.

Further, we will then stake the first 9,000 of those as examples for training and the remaining 1,000 examples to test the fit model.

Below is the complete example of loading the clean data, splitting it, and saving the split portions of data to new files.

Running the example creates three new files: the english-german-both.pkl that contains all of the train and test examples that we can use to define the parameters of the problem, such as max phrase lengths and the vocabulary, and the english-german-train.pkl and english-german-test.pkl files for the train and test dataset.

We are now ready to start developing our translation model.

Train Neural Translation Model

In this section, we will develop the neural translation model.

If you are new to neural translation models, see the post:

This involves both loading and preparing the clean text data ready for modeling and defining and training the model on the prepared data.

Let’s start off by loading the datasets so that we can prepare the data. The function below named load_clean_sentences() can be used to load the train, test, and both datasets in turn.

We will use the “both” or combination of the train and test datasets to define the maximum length and vocabulary of the problem.

This is for simplicity. Alternately, we could define these properties from the training dataset alone and truncate examples in the test set that are too long or have words that are out of the vocabulary.

We can use the Keras Tokenize class to map words to integers, as needed for modeling. We will use separate tokenizer for the English sequences and the German sequences. The function below-named create_tokenizer() will train a tokenizer on a list of phrases.

Similarly, the function named max_length() below will find the length of the longest sequence in a list of phrases.

We can call these functions with the combined dataset to prepare tokenizers, vocabulary sizes, and maximum lengths for both the English and German phrases.

We are now ready to prepare the training dataset.

Each input and output sequence must be encoded to integers and padded to the maximum phrase length. This is because we will use a word embedding for the input sequences and one hot encode the output sequences The function below named encode_sequences() will perform these operations and return the result.

The output sequence needs to be one-hot encoded. This is because the model will predict the probability of each word in the vocabulary as output.

The function encode_output() below will one-hot encode English output sequences.

We can make use of these two functions and prepare both the train and test dataset ready for training the model.

We are now ready to define the model.

We will use an encoder-decoder LSTM model on this problem. In this architecture, the input sequence is encoded by a front-end model called the encoder then decoded word by word by a backend model called the decoder.

The function define_model() below defines the model and takes a number of arguments used to configure the model, such as the size of the input and output vocabularies, the maximum length of input and output phrases, and the number of memory units used to configure the model.

The model is trained using the efficient Adam approach to stochastic gradient descent and minimizes the categorical loss function because we have framed the prediction problem as multi-class classification.

The model configuration was not optimized for this problem, meaning that there is plenty of opportunity for you to tune it and lift the skill of the translations. I would love to see what you can come up with.

For more advice on configuring neural machine translation models, see the post:

Finally, we can train the model.

We train the model for 30 epochs and a batch size of 64 examples.

We use checkpointing to ensure that each time the model skill on the test set improves, the model is saved to file.

We can tie all of this together and fit the neural translation model.

The complete working example is listed below.

Running the example first prints a summary of the parameters of the dataset such as vocabulary size and maximum phrase lengths.

Next, a summary of the defined model is printed, allowing us to confirm the model configuration.

A plot of the model is also created providing another perspective on the model configuration.

Plot of Model Graph for NMT

Plot of Model Graph for NMT

Next, the model is trained.

Each epoch takes about 30 seconds on modern CPU hardware; no GPU is required.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

During the run, the model will be saved to the file model.h5, ready for inference in the next step.

Evaluate Neural Translation Model

We will evaluate the model on the train and the test dataset.

The model should perform very well on the train dataset and ideally have been generalized to perform well on the test dataset.

Ideally, we would use a separate validation dataset to help with model selection during training instead of the test set. You can try this as an extension.

The clean datasets must be loaded and prepared as before.

Next, the best model saved during training must be loaded.

Evaluation involves two steps: first generating a translated output sequence, and then repeating this process for many input examples and summarizing the skill of the model across multiple cases.

Starting with inference, the model can predict the entire output sequence in a one-shot manner.

This will be a sequence of integers that we can enumerate and lookup in the tokenizer to map back to words.

The function below, named word_for_id(), will perform this reverse mapping.

We can perform this mapping for each integer in the translation and return the result as a string of words.

The function predict_sequence() below performs this operation for a single encoded source phrase.

Next, we can repeat this for each source phrase in a dataset and compare the predicted result to the expected target phrase in English.

We can print some of these comparisons to screen to get an idea of how the model performs in practice.

We will also calculate the BLEU scores to get a quantitative idea of how well the model has performed.

You can learn more about the BLEU score here:

The evaluate_model() function below implements this, calling the above predict_sequence() function for each phrase in a provided dataset.

We can tie all of this together and evaluate the loaded model on both the training and test datasets.

The complete code listing is provided below.

Running the example first prints examples of source text, expected and predicted translations, as well as scores for the training dataset, followed by the test dataset.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Looking at the results for the test dataset first, we can see that the translations are readable and mostly correct.

For example: “ich bin brillentrager” was correctly translated to “i wear glasses“.

We can also see that the translations were not perfect, with “hab ich nicht recht” translated to “am i fat” instead of the expected “am i wrong“.

We can also see the BLEU-4 score of about 0.45, which provides an upper bound on what we might expect from this model.

Looking at the results on the test set, do see readable translations, which is not an easy task.

For example, we see “tom erblasste” correctly translated to “tom turned pale“.

We also see some poor translations and a good case that the model could suffer from further tuning, such as “ich brauche erste hilfe” translated as “i need them you” instead of the expected “i need first aid“.

A BLEU-4 score of about 0.153 was achieved, providing a baseline skill to improve upon with further improvements to the model.

Extensions

This section lists some ideas for extending the tutorial that you may wish to explore.

  • Data Cleaning. Different data cleaning operations could be performed on the data, such as not removing punctuation or normalizing case, or perhaps removing duplicate English phrases.
  • Vocabulary. The vocabulary could be refined, perhaps removing words used less than 5 or 10 times in the dataset and replaced with “unk“.
  • More Data. The dataset used to fit the model could be expanded to 50,000, 100,000 phrases, or more.
  • Input Order. The order of input phrases could be reversed, which has been reported to lift skill, or a Bidirectional input layer could be used.
  • Layers. The encoder and/or the decoder models could be expanded with additional layers and trained for more epochs, providing more representational capacity for the model.
  • Units. The number of memory units in the encoder and decoder could be increased, providing more representational capacity for the model.
  • Regularization. The model could use regularization, such as weight or activation regularization, or the use of dropout on the LSTM layers.
  • Pre-Trained Word Vectors. Pre-trained word vectors could be used in the model.
  • Recursive Model. A recursive formulation of the model could be used where the next word in the output sequence could be conditional on the input sequence and the output sequence generated so far.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Summary

In this tutorial, you discovered how to develop a neural machine translation system for translating German phrases to English.

Specifically, you learned:

  • How to clean and prepare data ready to train a neural machine translation system.
  • How to develop an encoder-decoder model for machine translation.
  • How to use a trained model for inference on new input phrases and evaluate the model skill.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Note: This post is an excerpt chapter from: “Deep Learning for Natural Language Processing“. Take a look, if you want more step-by-step tutorials on getting the most out of deep learning methods when working with text data.

Develop Deep Learning models for Text Data Today!

Deep Learning for Natural Language Processing

Develop Your Own Text models in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Deep Learning for Natural Language Processing

It provides self-study tutorials on topics like:
Bag-of-Words, Word Embedding, Language Models, Caption Generation, Text Translation and much more...

Finally Bring Deep Learning to your Natural Language Processing Projects

Skip the Academics. Just Results.

See What's Inside

631 Responses to How to Develop a Neural Machine Translation System from Scratch

  1. Klaas January 10, 2018 at 7:53 am #

    amazing work again. One Question. Do you have a seperate tutorial where you explain the LSTM layers (Timedistributed, Repeatvector,…)?

      • Mayank July 29, 2018 at 11:15 pm #

        Hi, Jason your work is amazing. I am having one issue . How to convert large dataset in one-hot vectors as it will take more memory??

        • Jason Brownlee July 30, 2018 at 5:50 am #

          Perhaps progressively load the dataset and convert it?
          Perhaps use a smaller data sample?
          Perhaps use a machine with more ram?
          Perhaps use a big data pipeline like hadoop?

      • mira January 3, 2020 at 7:38 pm #

        translation = model.predict(source, verbose=0) i cant working this. I get error. Source is not defined. how can i solve?

      • afifa February 26, 2020 at 5:43 am #

        suppose, i have two files- 1st one has eng- germany text and the 2nd one has eng-spanish text. now i can i translate from germany to spain?

        • Jason Brownlee February 26, 2020 at 8:27 am #

          Why? The question seems flawed/incomplete.

        • Rodriq August 3, 2021 at 4:26 pm #

          Extract the German text and the corresponding Spanish text to form a new file, then use it to train the model. I guess 🙂

      • Anuj Kumar March 26, 2020 at 3:50 pm #

        Hello Jason, in python there is nothing like re_print , can you please guide me here.

      • sharath June 11, 2021 at 2:31 am #

        Hello Jason Iam getting three different elements after cleaning the data i can’t understand what the third element in this list means could you explain ?
        array([[‘theres nothing left to eat at home’,

        ‘es ist nichts zu essen mehr im haus’,

        ‘ccby france attribution tatoebaorg shekitten pfirsichbaeumchen’]

    • AGENT_24 January 20, 2020 at 5:04 am #

      how to translate new english text to german using predicted results?

  2. Mohamed January 10, 2018 at 1:51 pm #

    Your tutorials are amazing indeed. Thank you!
    Hope you will have the time to work on the Extensions lists above. This will complete this amazing tutorial.

    Thanks again!

  3. Richard January 12, 2018 at 5:52 am #

    Brilliant, thanks Jason. I’m looking forward to giving this a try.

  4. Parul January 14, 2018 at 7:47 am #

    hey i want to know one thing that if we are giving english to german translations to the model for training 9000 and for testing 1000.. then what is the encoder decoder model is actually doing ..as we are giving everything to the model at the time of testing.

    • Jason Brownlee January 15, 2018 at 6:54 am #

      The model is not given the answer, it must translate new examples.

      Perhaps I don’t follow your question?

      • Barnabas March 13, 2019 at 10:18 pm #

        Then how do i enter the example? on which line are you picking it

  5. abkul orto January 15, 2018 at 5:38 pm #

    Hi Jason,

    I am regular reader of your articles and purchased books.i want to work on translation of a local language to english.kindly advice on the steps.

    thanks you

  6. kannu January 20, 2018 at 4:50 am #

    # prepare regex for char filtering
    re_print = re.compile(‘[^%s]’ % re.escape(string.printable))

    can u please explain me the meaning of this code for ex what is string.printable actually doing and what is the meaning of (‘[^%s]’

    • Jason Brownlee January 20, 2018 at 8:24 am #

      I am selecting “not the printable characters”.

      You can learn more about regex from a good book on Python.

  7. Harish Yadav January 20, 2018 at 9:22 pm #

    Excellent explanation i would say!!!! damn good !!!looking to develop text-phonemes with your model !!!

  8. Drishty January 23, 2018 at 8:28 pm #

    Hi , Jason your wok is amazing and while i was doing this code i found this and i want to know i it’s required ti reshape the sequence ? and what sequence.shape[0],sequence.shape[1] is doing.
    and why we need the vocab size ?
    y = y.reshape(sequences.shape[0], sequences.shape[1], vocab_size)

  9. Drishty January 23, 2018 at 8:29 pm #

    *want to know why it’s required to reshape the sequence ? and what

    • Jason Brownlee January 24, 2018 at 9:55 am #

      We must ensure that the data is the correct shape that is expected by the model, e.g. 2d for MLPs, 3D for LSTMs, etc.

  10. firoz January 24, 2018 at 4:41 am #

    hi ,

    i wanted to ask tyou why we have not done one-hot encoding for text in german.?

    • Jason Brownlee January 24, 2018 at 9:58 am #

      The input data is integer encoded and passed through a word embedding. No need to one hot encode in this case.

  11. ravi January 25, 2018 at 4:59 am #

    hello sir,

    over here the load_model is not defined .

    thank you .

    • Jason Brownlee January 25, 2018 at 5:58 am #

      from keras.models import load_model

    • ravi January 25, 2018 at 6:17 am #

      can please tell me where the

      translation = model.predict(source, verbose=0)

      error: source is not deifined

      • Jason Brownlee January 25, 2018 at 9:07 am #

        Sorry, I have not seen that error. Perhaps try copying the entire example at the end of the post?

  12. asheesh January 25, 2018 at 6:36 am #

    while running above code i am facing memory error in to_categorical function. I am doing translation for english to hindi. Pls give any suggestion.

    • Jason Brownlee January 25, 2018 at 9:09 am #

      Perhaps try updating Keras?
      Perhaps try modifying the code to use progressive loading?
      Perhaps try running on AWS with an instance that has more RAM?

  13. Harish Yadav January 25, 2018 at 11:20 pm #

    please do a model on attention with gru and beam search

  14. Harish Yadav January 30, 2018 at 4:13 pm #

    i have used bidirectional lstm,got a better result…i want to improve more …but i dont know how to implement attention layer in keras…could you please help me out…

  15. hayet January 31, 2018 at 9:48 pm #

    Hi, I want know why you use model.add(RepeatVector(tar_timesteps))?

  16. hayet February 2, 2018 at 12:11 am #

    is it possible to calculate the NMT model score with this method

    model.compile(optimizer=’adam’, loss=’categorical_crossentropy’, metrics=[‘accuracy’])

    scores = model.evaluate(testX,testY)

    • Jason Brownlee February 2, 2018 at 8:20 am #

      It will estimate accuracy and loss, but not bot give you any insight into the skill of the NMT on text data.

  17. Darren February 20, 2018 at 5:03 am #

    Hi Jason, brilliant article!

    Just a quick question, when you configure the encoder-decoder model, there seems no inference model as you mentioned in your previous articles? If this model has achieved what inference model did, in which layer? If not, how does it compare to the suite of train model, inference-encoder model and inference-decoder model? Thank you so much!

  18. Jakobe February 25, 2018 at 4:45 am #

    Does text_to_sequences encode data ?
    according to the documentation it just transform texts to a list of sequences

    • Jason Brownlee February 25, 2018 at 7:45 am #

      Yes, it encodes words in text to integers.

      • Jakobe March 6, 2018 at 9:38 am #

        Could you verify This documentation. It is mentionned that text_to_sequences return STR.
        I am confusing right now.
        https://keras.io/preprocessing/text/

        • Jason Brownlee March 6, 2018 at 2:55 pm #

          For “texts_to_sequences” on Tokenizer it says:

          “Return: list of sequences (one per text input).”

  19. Emil March 6, 2018 at 10:41 am #

    ImportError: cannot import name ‘corpus_bleu’
    Did anyone have an idea about this error.

  20. Dirck March 10, 2018 at 8:54 pm #

    By following your tutorial, I was able to find BLEU scores on test dataset as follow :
    BLEU-1: 0.069345
    BLEU-2: 0.255634
    BLEU-3: 0.430785
    BLEU-4: 0.490818

    So we can notice that they are very close to the scores on train dataset.
    Is it about overfitting or it is a normal behavior ?

    • Jason Brownlee March 11, 2018 at 6:25 am #

      Nice work!

      Similar scores on train and test is a sign of a stable model. If the skill is poor, it might be a stable but underfit model.

  21. vikas dixit March 10, 2018 at 11:12 pm #

    Hello sir, you are using test data as validation data. This means model has seen test data during training phase only. I think test data is kept separated. Am I right?? If yes please explain logic behind it.

  22. sindhu reddy March 20, 2018 at 2:32 am #

    Hello sir, great explanation. everything works well with the given corpus.when i am using the own corpus it says .pkl file is not encoded in utf-8.

    can you please share the the encoding of the text files used for the above project?

    It is giving following error
    —————————————————————————
    IndexError Traceback (most recent call last)
    in ()
    65 # spot check
    66 for i in range(100):
    —> 67 print(‘[%s] => [%s]’ % (clean_pairs[i,0], clean_pairs[i,1]))

    IndexError: too many indices for array

    Kindly help

    • Jason Brownlee March 20, 2018 at 6:26 am #

      Perhaps double check you are using Python 3?

      • sindhu reddy March 20, 2018 at 6:30 pm #

        yes i am using python 3.5

        • Jason Brownlee March 21, 2018 at 6:31 am #

          Are you able to confirm that all other libs are up to date and that you copied all of the code from the example?

  23. sindhu reddy March 21, 2018 at 5:06 pm #

    yes jason i have updated all the libraries. it is working completely fine for the deu,txt file .
    when ever i use my own text file it is giving the following error.

    can you kindly tell what formatting is used in text file.

    Thanks

    • Jason Brownlee March 22, 2018 at 6:19 am #

      As stated in the post, the format is “Tab-delimited Bilingual Sentence Pairs”.

  24. Jigyasa Sakhuja March 24, 2018 at 3:47 am #

    hi Jason i am a fan of yours and i have implemented this machine translation and it was awesome i got all the results perfectly .. now i wanted to generate code using natural language by using RNN.. and when i am reading my file which is of declartaion and docstrings it is not showing as it is the ouput .. like it should show the declarations but it is showing something like x00/x00/x00/x00/x00/x00/x00/x00/x00/x00/x00/x00/x00/x00/x00/x00/x00/x00/x00/x00/

    but it should show
    if cint(frappe.db.get_single_value(u’System DCSP Settings’, u’setup_complete’)):

  25. sasi March 28, 2018 at 5:59 pm #

    In your data x is English and y is german… but in the code x is German, and y is english… why that difference????????????

    • Jason Brownlee March 29, 2018 at 6:31 am #

      We are translating from German (X) to English (Y).

      You can learn the reverse if you prefer. I chose not to because my english is better than my german.

  26. Kam March 29, 2018 at 8:48 pm #

    Hi,
    I am trying to use pre trained word embeddings to make translation.
    But, after making some researrch I found that pre-trained word embeddings are just only user for initialize encoder and decoder and also we nedd only the src embeddings.
    So, for the moment I am confused.
    Normally, must we provide source and target embeddings to the algorithme ?
    Please if they are some documentation or links about this topic.

    • Jason Brownlee March 30, 2018 at 6:37 am #

      Not sure I follow, what do you mean exactly?

      You can use a pre-trained embedding. This is separate from needing to have input and output data pairs to train the model.

  27. Sindhura April 4, 2018 at 3:57 am #

    Regarding recursive model in extensions, isn’t it already implemented in the current code? Because the decoder part is lstm and is lstm output of one unit is fed to the next unit.

  28. Max b April 17, 2018 at 3:55 am #

    “be stolen returned” is my systems translation of “vielen dank jason”, which ist supposed to mean: Thank you so much Jason!

    This post helped me a lot and I’ll now continue to tune it. Keep up the awesome work!

  29. suraj April 17, 2018 at 7:38 pm #

    In machine translation why we need vocabulary with the english text and german text …?

    • Jason Brownlee April 18, 2018 at 8:02 am #

      We need to limit the number of words that we model, it cannot be unbounded, at least in the way I’m choosing to model the problem.

      • michael April 20, 2018 at 12:24 am #

        That suggests that it can be unbounded if you model it in a different way.

  30. AlgoP April 24, 2018 at 11:42 pm #

    Hi Jason,
    I have just tested the clean_pairs method against ENG-PL set provided on the same website.One of the characters does not print on the screen( ‘all the other non ASCII chars are converted correctly), it is ignored as per this line I guess:

    I did an experiment with replacing the above with line = normalize(‘NFD’, line).encode(‘utf-8’, ‘ignore’), but there is no difference between these two in results.I am not sure why this is happening as it is only one letter.Also,( I assume your chose was ascii as you built a German to English translator am I correct?).Could you plase share your thoughts, if possible?

    • Jason Brownlee April 25, 2018 at 6:33 am #

      Perhaps you’re able to inspect the text or search the text for non-ascii chars to see what the offending characters are?

      This might give you insight into what is going on.

    • AlgoP April 25, 2018 at 6:44 am #

      I am working on it -it looks like it may be the issue with re.escape method rather than with encoding itself.

  31. Johny May 1, 2018 at 9:49 pm #

    Does removing punctuation not preventing the model to be used to predict a paragraph? How can you evaluate it with one sentence or paragraph not in the test set?

    • Jason Brownlee May 2, 2018 at 5:39 am #

      You can provide data to the model and make a prediction.

      call the predict_sequence() function we wrote above.

  32. Umesh May 1, 2018 at 10:53 pm #

    From Keras. Proprocessing. Text import Tokenizer
    ..
    Does not woking after installing keras..
    ..
    It’s says that no module named tensorflow
    ..
    I have windows 32 it machine.
    ..
    Your article very good…!
    .
    But I can’t process ahead due to this problem!

  33. Jundong May 4, 2018 at 9:53 am #

    Thank you for your article, Jason!

    I have one question about the difference between your implementation and the Keras Tutorial “https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html”. It seems to me that, there is a ‘teaching forcing’ element in the “Keras Tutorial” using “target” (offset by one step) as “decoder input data”. This element is not presented in your model. My question is: is it necessary? or you just use “RepeatedVector” and “TimeDistributed” to implement the similar function?

    Thank you!

  34. Beay May 5, 2018 at 9:08 pm #

    Great help Jason, thank you one more time, i want to ask you:

    How can i implement bidirectional lstm code for further improvements? at below what i did on codes please fix it with your knowledge.

    def define_model(src_vocab, tar_vocab, src_timesteps, tar_timesteps, n_units):
    model = Sequential()
    model.add(Embedding(src_vocab, n_units, input_length=src_timesteps, mask_zero=True))
    model.add(Bidirectional(LSTM(n_units)))
    model.add(RepeatVector(tar_timesteps))
    model.add(Bidirectional(LSTM(n_units, return_sequences=True)))
    model.add(TimeDistributed(Dense(tar_vocab, activation=’softmax’)))
    return model

  35. Beay May 6, 2018 at 1:05 am #

    In this below code

    # remove non-printable chars form each token
    line = [re_print.sub(”, w) for w in line]

    in Turkish words i got this sample errors for example

    “kaç” -> “kac” , “koş”->”kos”

    how can i fix it ?

    thank you

    • Jason Brownlee May 6, 2018 at 6:31 am #

      I don’t follow sorry. What is the problem exactly?

  36. Beay May 6, 2018 at 7:25 am #

    i have used these codes on a Turkish-English corpus file and some Turkish characters are

    missing (ç,ğ,ü,ğ,Ö,Ğ,Ü,İ,ı)

    thank you.

    • Jason Brownlee May 7, 2018 at 6:45 am #

      Missing after the conversion?

      Perhaps normalizing to Latin characters is not the best approach for your specific problem?

  37. Sai May 18, 2018 at 4:55 am #

    Thank you very much. Could you please help where can I get good dataset for Thai to English. The dataset for Thai language is available from the ManyThings.org website is with lesser data.I am trying to use this approach to build similar for Thai.

    • Jason Brownlee May 18, 2018 at 6:27 am #

      Sorry, I don’t know off hand.

    • Sai May 18, 2018 at 10:39 pm #

      Please ignore my query, i have searched and got the dataset. Thank you for these articles

  38. pep May 18, 2018 at 7:35 pm #

    Once the model is trained, could be used the model to predict in both directions, I mean: english-german, german-english.

  39. Meghna May 23, 2018 at 9:10 pm #

    Hi Jason, thank you for the amazing tutorial. It really helped me. I implemented the above code and understood each function. Further, I want to implement Neural conversation model as given in https://arxiv.org/pdf/1506.05869.pdf on dialogue data. So, I have 2 questions, first is how to make pairing in dialogue data and second is how to feed previous conversations as input to the decoder model.

    • Jason Brownlee May 24, 2018 at 8:11 am #

      Sorry, I don’t have an example of a dialog system. I hope to cover it in the future.

  40. Ahmad Ahmad May 24, 2018 at 6:30 pm #

    G.M Mr Jason …

    In my model , I find BLEU scores on train dataset as follow :

    BLEU-1: 0.736022
    BLEU-2: 0.717377
    BLEU-3: 0.710192
    BLEU-4: 0.692681

    So we can notice that they are higher from the scores on train dataset.
    Is it normal behavior or is it bad ?

  41. maitha May 28, 2018 at 1:07 pm #

    Hi Jason,
    Great and helpful work, I am trying the code to translate Arabic to English but in first step (Clean Text) and it give me an empty [ ]?! how can I solve this one.
    [hi] => []
    [run] => []
    [help] => []

  42. Sastry May 28, 2018 at 11:24 pm #

    Hi Jason,

    Thanks for sharing a easy and simple approach for translations.

    I tried your code to work with Indian languages and found Hindi data set in the same location from where you shared the German dataset.

    The following normalize code for Hindi removes the character from line. I have tried with NFC, still facing the same problem. If I skip this line then, the non-printable character line is skipping the hindi text.

    print(‘Before: ‘, line)
    # normalize unicode characters
    line = normalize(‘NFD’, line).encode(‘ascii’, ‘ignore’)
    print(‘After: ‘,line)

    Before: Go.
    After: b’Go.’
    Before: जा.
    After: b’.’

    Does skipping these two lines of code affect the training in any way?

    Thanks,
    Sastry

    • Jason Brownlee May 29, 2018 at 6:26 am #

      Yes, the code example expects to work with Latin characters.

    • kamal deep garg October 1, 2018 at 12:49 pm #

      Hi Sastry sir

      Does your problem with hindi data resolve?

  43. kamal deep garg May 29, 2018 at 3:43 pm #

    Hello sir

    what is minimum Hardware requirement to train nmt using keras?

  44. Srijan Verma May 31, 2018 at 6:31 pm #

    Hi Jason,

    This post is really helpful. Thanks for this.

    I am working on building a translator which translates from English to Hindi (or any other Indian language). But I am facing a problem while cleaning the data.
    The normalize code does not work for Indian languages, and if I skip that line of code then I am not getting any output after training my data.

    Is there a way to use the same code on your post and some other way to clean the data for Indian languages to get the desired output..? Like are there any python modules/Libraries that i should install so as to use them for Indian Languages.?

    Thanks!

    • Jason Brownlee June 1, 2018 at 8:17 am #

      You may have to research how to prepare hindi data for NLP.

      Perhaps converting to latin chars in not the best approach.

  45. lakshm June 1, 2018 at 3:02 pm #

    Hello,

    Aren’t we supposed to pass the English data along with the encoded data to decoder.As per my understanding only the encoded German data has been passed to the decoder right??

  46. Sai June 5, 2018 at 6:57 pm #

    Hi Jason,

    I have now progressed upto Training the model. Cleaning & tokenizing the data set took time as i used a different language, but was a good learning.

    Wanted to know whats the significance of “30 epochs and a batch size of 64 examples” in your example. Are these anyways related to Total vocabulary (or) total trainable parameters ?

    Also, could you please guide me to any article of yours where i can learn more around what is epochs, what is BLEU score , what is loss etc.

    Thank you

  47. Sai June 7, 2018 at 9:43 pm #

    Hi Jason,

    I have a silly question, but wanted to seek clarification.

    In step “Train Neural Translation Model” :- have used 10,000 rows from the dataset, and established the model in file model.h5 for xxx number of vocabularies.
    If I extract next 10,000 rows from data and continue to train the model using the same lines of code above, would it use the previously established model from model.h5 or would it be overwritten and start as fresh data being used to train ?

    Thank you,

    • Jason Brownlee June 8, 2018 at 6:11 am #

      Yes, the model will be trained using the existing model as a starting point.

  48. Sai June 8, 2018 at 3:02 pm #

    Hi Jason,

    ok, understood.

    Referred to your article https://machinelearningmastery.com/check-point-deep-learning-models-keras/ and understood that, before compiling the model using model.compile(), i have to load the model from file, to use existing model as starting point in training.

    Thank you very much.

    • Jason Brownlee June 9, 2018 at 6:45 am #

      Glad it helped.

    • Deeksha May 8, 2019 at 5:19 am #

      DId you try using model.fit_generator?

  49. Paul June 8, 2018 at 3:19 pm #

    Hi Jason,
    Can Word2Vec be used as the input embedding to boost the LSTM model ? Or say that pre-trained word vector by Word2Vec as input of the model can get better?

    Thanks!

  50. Raghavendra June 12, 2018 at 11:06 am #

    Hello Jason,
    Excellently written article with intricate concepts explained in such a simple manner.However it would be great if you can add a attention layer for handling larger sentences.

    I tried to add a attention layer to the code above by referring the below one.
    https://github.com/keras-team/keras/issues/4962

    I am unable to add the attention layer..I have read your previous blog on adding attention

    https://machinelearningmastery.com/encoder-decoder-attention-sequence-to-sequence-prediction-keras/

    But the vocabulary at the output end is too large to be processed and this is not solving the problem

    It would be great if you add attention ( bahdanu’s or luong’s ) to your above code and solve the problem of larger sentences

    Thanking you !

    • Jason Brownlee June 12, 2018 at 2:27 pm #

      Thanks, I hope to develop some attention tutorials once it is officially supported by Keras.

      • Raghavendra June 12, 2018 at 3:23 pm #

        How about including the attention snippet as you did in the later case.this code is working fine for me except that attention can handle longer sentences and this is where I am facing issues.I was actually asking for adding attention to the above code as you did in the later case.

        • Jason Brownlee June 13, 2018 at 6:15 am #

          Sorry, I cannot create a custom example for you.

          I hope to give more examples of attention when Keras officially supports attention.

  51. Aparajita June 21, 2018 at 9:55 pm #

    Hi, I want to convert from english to german, Please help me what kind of changes required? I did few changes but it didn’t work. Please help me how can I reverse it?

    • Jason Brownlee June 22, 2018 at 6:08 am #

      It should be straight forward. Sorry, I don’t have the capacity to prepare an example for you.

  52. ricky June 22, 2018 at 5:48 pm #

    halo sir, how to modification this project to use existing model (.h5) for next project running without training again, so i just use the model ?

  53. Basil June 23, 2018 at 5:21 am #

    Jason – What’s your next tutorial, would be waiting for the next one eagerly, how would i get notified about your next one?

  54. Alex J July 3, 2018 at 4:47 pm #

    Hi Jason! Thanks for your amazing tutorial! Very clear and easy to understand. One question comes up during my reproducing of your code: the console warns that “The hypothesis contains 0 counts of 2-gram, 3-gram and 4-gram overlaps”, which leads to BLEU-2 to 4 are 0. I can’t find the reason, coz I just completely copied your code and it still doesn’t work. Can you help me with that? Thank you!

  55. Hani July 4, 2018 at 3:12 am #

    Hi,

    Could you please help me to convert a German word to a sequence of numbers?

  56. sree harsha July 5, 2018 at 2:04 am #

    Hi,
    amazing article! Here we encode the sequences(into one hot vector) and then give them input to encoder lstm and this is passed onto the decoder lstm. Is my understanding correct? how can I give an input to hidden states of an lstm?

    • Jason Brownlee July 5, 2018 at 8:00 am #

      No, we do not one hot encode the input, we provide sequences of integers to the word embedding.

  57. Hani July 5, 2018 at 7:33 am #

    Hi,

    thank you for answering. I have another question. How can I use one hot encoding for the sequences in which it returns a 2D array not a 3D?

  58. Jack July 6, 2018 at 6:19 am #

    Really amazing post! Was surprised by the accuracy and limited training time. I have tried the model with a different dataset (two columns of sentences), but get a problem in the code for loading the clean data, splitting it, and saving the split portions of data to new files. line 20:
    dataset = raw_dataset[:n_sentences, :]

    IndexError: too many indices for array

    For print(raw_dataset) with your deu.txt, I get:
    [[‘Sentence A’ ‘Sentence a’] [‘Sentence B’ ‘Sentence b’] etc. ]

    But for print(raw_dataset) with my file, I get:
    [ list([‘sentence A’, ‘sentence a’]) list([‘sentence B’, ‘sentence b’]) etc.]

    Any tips what I could do about this?

  59. Josh Reid July 8, 2018 at 12:17 am #

    Hey Jason, amazing article, this helped immensely improve my understanding of how NMT works in the background!

    I experienced the same issue as Alex J where the evaluation portion of the code where BLEU-2, 3 and 4 scores are all 0 and throw warnings like:
    “The hypothesis contains 0 counts of 2-gram overlaps.
    Therefore the BLEU score evaluates to 0, independently of
    how many N-gram overlaps of lower order it contains.
    Consider using lower n-gram order or use SmoothingFunction()”

    I’m not sure if something within nltk.bleu_score.corpus_bleu changed since you created this script but it looks like you need an additional list around each entry in actual. This is fixed by changing line 60 in that script from:
    actual.append(raw_target.split())
    to:
    actual.append([raw_target.split()])

    • Jason Brownlee July 8, 2018 at 6:23 am #

      Thanks Josh.

      • Karim November 27, 2018 at 4:02 am #

        Yes, indeed it works with:
        actual.append([raw_target.split()])
        The reference for each sentence should be a list of different correct sentences.

  60. Jack July 8, 2018 at 10:21 pm #

    Dear Jason, would it also be possible to use this model to do ‘translations’ within one language? For example, to use duplicate sentences as pairs such as:

    [‘The distance from the earth to the moon is 384.400 km’ ‘The moon is located 384.400 km away from the earth’]

    Given enough good examples, do you think this would work? I have tried it but get lousy results. Perhaps doing something wrong.

    • Jason Brownlee July 9, 2018 at 6:35 am #

      With enough training data, yes, you could do this.

      • Jack July 19, 2018 at 12:56 am #

        Dear Jason, I have just replaced the deu.txt dataset with a dataset containing two columns of English sentences and get the following (strange) predictions. Any suggestions what might cause this?

        src=[the best apps for increasing vocabulary are], target=[what are the best apps for increasing vocabulary], predicted=[and and and and and and and and and and and does does el el el el el]
        BLEU-1: 0.027778
        BLEU-2: 0.166667
        BLEU-3: 0.341279
        BLEU-4: 0.408248

        • Jason Brownlee July 19, 2018 at 7:54 am #

          Perhaps confirm that you are loading the dataset as you expect.

          You may then have to tune the model to this new dataset.

        • Remi June 11, 2019 at 11:59 pm #

          Hi,
          I’m currently doing something similar as I am trying to translate grammatically wrong french to correct french. Thing is, I also get some strange results like yours
          I’m not sure you will see this message but have you solved your problem? 🙂

          • Jason Brownlee June 12, 2019 at 8:04 am #

            Perhaps try tuning the model?

            Perhaps try more data?

            Perhaps try a different model architecture?

          • Raghav Sood June 17, 2019 at 7:39 pm #

            “There are duplicate phrases in English with different translations in German”. What problems does having duplicate phrases cause? What if I want a model to learn sentences similar in meaning to the input sentence( i.e. multiple possible outputs for the same input)? Which model would you recommend for such a situation?

          • Jason Brownlee June 18, 2019 at 6:37 am #

            It can be confusing to the model and result in lower skill.

            Simplify the problem for the model whenever possible.

  61. Sayantika Dey July 12, 2018 at 8:44 am #

    how much time does it take to print the Bleu score?
    Actually that part of the code is not working for me and its not printing the Bleu score and again again when i try to plot the model, it shows install Graphviz but i already have that.

    • Jason Brownlee July 12, 2018 at 3:28 pm #

      It depends on your hardware, but it should not take excessively long.

      If you are getting strange results, ensure you have the latest versions of all of the libraries and that you have copied all of the code required.

  62. C M Khaled Saifullah July 18, 2018 at 5:38 am #

    First of all thanks for the tutorial, it helps me a lot.

    If i like to incorporate attention mechanism and beam search in the decoder, which part of the code need to be changed?

    From my basic understanding i received from the your following tutorial:

    https://machinelearningmastery.com/encoder-decoder-attention-sequence-to-sequence-prediction-keras/

    I need to replace the following code:

    def define_model(src_vocab, tar_vocab, src_timesteps, tar_timesteps, n_units):
    model = Sequential()
    model.add(Embedding(src_vocab, n_units, input_length=src_timesteps, mask_zero=True))
    model.add(LSTM(n_units))
    model.add(RepeatVector(tar_timesteps))
    model.add(LSTM(n_units, return_sequences=True))
    model.add(TimeDistributed(Dense(tar_vocab, activation=’softmax’)))
    return model

    into

    def define_model(src_vocab, tar_vocab, src_timesteps, tar_timesteps, n_units):
    model = Sequential()
    model.add(Embedding(src_vocab, n_units, input_length=src_timesteps, mask_zero=True))
    model.add(LSTM(n_units))
    model.add(AttentionDecoder(n_units, n_features))
    return model

    After writing the custom attention layer code given in that post.

    I am not sure about the parameter n_features for this problem. Can you clarify it? Beside, can you help me to find the implementation of beam search?

    Thanks for your time.

  63. Parul Singla July 18, 2018 at 3:58 pm #

    Sir, i’m using english-hindi translation dataset. while printing the saved file code is showing the output like…

    [has tom left] => []
    [he is french] => []
    [i am at home] => []
    [i cant move] => []
    [i dont know] => []

    Why i’m not able to see Hindi text. Is there any requirement of encoding decoding again?

    • Jason Brownlee July 19, 2018 at 7:46 am #

      Sorry, I don’t know. I don’t have any examples working with Hindi text.

    • Aman Saini September 7, 2022 at 1:23 am #

      Use these Steps and you will get hindi text.

      def clean_pairs(lines):
      cleaned = list()

      table = str.maketrans(”, ”, string.punctuation)
      for pair in lines:
      clean_pair = list()
      for line in pair:

      line = line.split()
      line = [word.lower() for word in line]
      line = [word.translate(table) for word in line]

      clean_pair.append(‘ ‘.join(line))
      cleaned.append(clean_pair)
      return np.array(cleaned)

  64. Souraj July 22, 2018 at 11:24 pm #

    Hello Jason,

    Would it be possible to include a diagram or visualization to show how the dimensions match up in layers used? I am having a hard time figuring out how does the network exactly look like. Thanks in advance. For example, why repeat vector is necessary.

    • Jason Brownlee July 23, 2018 at 6:11 am #

      Yes, you can summarize what the model expects:

      And you can review your data:

  65. Nitin July 30, 2018 at 12:57 am #

    After save the model and load the model then i want to translate only one line randomly then how can i do that?

    • Jason Brownlee July 30, 2018 at 5:50 am #

      model.predict(…)

      • Rishav February 9, 2020 at 9:53 pm #

        how to check with my custom input??Instead of test data set

        • Jason Brownlee February 10, 2020 at 6:30 am #

          Prepare the new data in the same way as training (cleaning and tokenization) then provide it to the model the same as we do in the last section fo the above tutorial.

  66. kamalika August 3, 2018 at 11:01 pm #

    H Jason,
    Thanks for this tutorial.
    I was trying to translate from Chinese to English and looking at clean_pairs function, I think for Chinese characters, this can’t be applied.
    Can you give me some pointers on how to generate the clean text for translation model.
    I am using the dataset from many.org.

    • Jason Brownlee August 4, 2018 at 6:10 am #

      You may have to update the example to work with unicode instead of chars.

  67. Rohit August 29, 2018 at 4:18 pm #

    Hello Jason, It was a great article. I tried to implement it for ger – eng and it worked fine. But when I am implementing it for Korean to English junk output is coming

    src=[경고 고마워], target=[thanks for the warning], predicted=[i i the]
    src=[입조심해라], target=[watch your language], predicted=[i i you]
    src=[없다], target=[there arent any], predicted=[i i you]
    src=[톰은 외롭고 불행해], target=[tom is lonely and unhappy], predicted=[i i the]
    src=[그녀의 신앙심은 굳건하다], target=[her faith in god is unshaken], predicted=[i i the to]
    src=[세계는 너를 중심으로 돌아가지 않는다], target=[the world doesnt revolve around you], predicted=[i i i to to]
    src=[못 믿겠는데], target=[i can hardly believe it], predicted=[i i the]
    src=[그 약은 효과가 있었다], target=[that medicine worked], predicted=[i i]
    src=[모두 그녀를 사랑한다], target=[everybody loves her], predicted=[i i]

    I have used training data from manythings.org having 773 lines(600 lines for training ,173 lines for testing).

    Can you please guide me what can be the issue.

    • Jason Brownlee August 30, 2018 at 6:26 am #

      Perhaps the Korean characters need special handling?

      Perhaps the model needs further tuning?

  68. Ajita September 10, 2018 at 9:33 pm #

    Hey Jason,thanks for such an awesome content.I have a doubt regarding why it is necessary to convert unicode to ascii for preparing the dataset.And why NFD format is exclusively used?

    • Jason Brownlee September 11, 2018 at 6:29 am #

      It is not required, it just made my example simpler.

  69. Bhimasen September 26, 2018 at 3:34 pm #

    HI, Very Nice works in this blog. This LSTM also i applied for native Indian languages and got good results and scores. Great tutorial.!!!

    My question is, i want to make kind of federated learning here. The model created by this dataset will be kept as general model. Suppose I have a another dataset (similar, but small), and I train a model using same code and a new model is generated. Now i want to merge the weights of this new model with the one previously generated.

    How can I work around to achieve this. ? Any suggestions would be greatly appreciated.

    • Jason Brownlee September 27, 2018 at 5:55 am #

      Nice work!

      You could keep both models and use them in an ensemble.

    • kamal deep garg October 1, 2018 at 12:52 pm #

      Hi Bhimasen

      i am also doing work on Indian languages.

      getting stuck in preprocessing of Punjabi

  70. Michał September 26, 2018 at 5:38 pm #

    Hi Jason
    great tutorial – works fine with german -> english, but when I am using my own dictoniary then the predicted output is empty (“[]”).
    My dictionary is quite specific, it is sentence to sentence, like:
    “when raining then use umbrella6” -> “trigger raining check umbrella6”
    I have like 1000 lines (maybe too little) of simillar sentences and they contain this strange “umbrella6” strings (so string+ID).
    I was expecting that the results may not make any sense, but empty predict is something strange – there should be something?

  71. Ash September 28, 2018 at 7:27 am #

    May be I missed that but what happens if there is a new/unseen word in the input text? Rather what is expected in the output?

    • Jason Brownlee September 28, 2018 at 2:58 pm #

      Unseen words are marked as 0 by the Tokenizer.

  72. Cathal October 6, 2018 at 7:34 am #

    Hi Jason,

    Great tutorial, love your blog! I was just wondering how I can pass in my own input to be translated. How do I just pass in one sentence. Everything I have tried is not working!

    • Jason Brownlee October 6, 2018 at 11:42 am #

      If you have text to be translated, you can use google translate.

      If you want to use the model to make a prediction, you must encode new text using the same scheme used to prepare the training data then call model.predict().

  73. Tom Chan October 12, 2018 at 2:32 am #

    Hi Jason,

    Thanks for your detailed step by step process in walking everyone through. I have one help needed.

    What needs to be changed above for Chinese Portuguese machine translator?

    I target to do a (bi-directional) LSTM but cannot find existing word data file as the source.

    Hope you can point me the direction and thanks.

    B.Rgds,
    Tom

    • Jason Brownlee October 12, 2018 at 6:42 am #

      The model may need to be tuned for your new dataset.

  74. Ali October 15, 2018 at 3:38 am #

    When I run the evaluation I get the following result:
    UserWarning:
    The hypothesis contains 0 counts of 4-gram overlaps.
    Therefore the BLEU score evaluates to 0, independently of
    how many N-gram overlaps of lower order it contains.
    Consider using lower n-gram order or use SmoothingFunction()
    warnings.warn(_msg)
    BLEU-1: 0.077830
    BLEU-2: 0.000000
    BLEU-3: 0.000000
    BLEU-4: 0.000000

    How can I fix this?

    • Jason Brownlee October 15, 2018 at 7:33 am #

      Perhaps check the types of text generated by your model, your model may not have converged to a useful solution.

      • Bond October 23, 2018 at 12:18 am #

        How do we fix the issue? I tried re-running the model from the start again. It is showing the same result.

        /usr/local/lib/python3.5/dist-packages/nltk/translate/bleu_score.py:503: UserWarning:
        The hypothesis contains 0 counts of 4-gram overlaps.
        Therefore the BLEU score evaluates to 0, independently of
        how many N-gram overlaps of lower order it contains.
        Consider using lower n-gram order or use SmoothingFunction()
        warnings.warn(_msg)
        BLEU-1: 0.077346
        BLEU-2: 0.000000
        BLEU-3: 0.000000
        BLEU-4: 0.000000

        The same warning is there for 2-gram and 3-gram.

        • Jason Brownlee October 23, 2018 at 6:27 am #

          Perhaps try changing the configuration of the model?

  75. Bond October 22, 2018 at 5:14 pm #

    Hi, thanks for your contribution.

    Could you please clarify some of the doubts:

    1. In the CLEAN TEXT step, inside clean_pairs() function, line number 7 talks about making a translation table for removing punctuation.

    In the code, str.maketrans(”, ”, string.punctuation)
    gives error with str as an undefined attribute.

    And also what is “maketrans” function?

    2. Regarding the function “to_pairs”, this function converts the dataset in the following format:

    Original:
    Hi. Hallo!
    Hi. Grüß Gott!
    Run! Lauf!

    After:
    Hi.
    Hallo!
    Hi.
    Grüß Gott!
    Run!
    Lauf!

    i.e. put the corresponding translation in the next line by splitting the phrase pairs.

    Thanks.

    • Jason Brownlee October 23, 2018 at 6:21 am #

      You may be trying to use Python 2.7, I recommend using Python 3.5 or higher.

  76. satya October 25, 2018 at 5:41 pm #

    how this implementation differs from keras implemenation ?

    https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html

    which one to prefer ?

    • Jason Brownlee October 26, 2018 at 5:32 am #

      Here we use an auto-encoder approach, in the keras blog post an encoder-decoder using only internal state is used instead.

      Use an approach that results in the best performance for your problem.

  77. Akshat Jain October 28, 2018 at 12:26 am #

    Hiii Jason,
    Thanks for this wonderful article. I have been trying to implement this and I got a doubt in

    prediction = model.predict(testX, verbose=1)[0]

    Why we only take single encoded source?

    • Jason Brownlee October 28, 2018 at 6:13 am #

      There is only one prediction/row, so we take it from the 2D array.

      • Akshat Jain October 29, 2018 at 8:56 pm #

        Sorry I don’t understand, the shape of prediction would be (1000, 5, 2309) but we only take the zeroth element from it. Why?

        • Jason Brownlee October 30, 2018 at 6:00 am #

          No, we are only translating one sentence of words at a time.

          To confirm, print the shape of the input and output of the predict function prior to only selecting the zero’th element.

  78. Daniel Fernandez Boada November 20, 2018 at 6:40 pm #

    Hi Jason,

    Thank you for sharing this great article. Because of my null progress in learning German, after four years living in a German speaking country, I decided to create an application that I think could help me with it, and maybe to others too.

    As a first step I think your approach may fit well with my requirements. My question is, are all the codes shown here free to reproduce or is there and copyright?

    Thanks again,
    Dani.

  79. saas November 27, 2018 at 3:49 am #

    hello
    could you please help me
    i m doing same work neural translation from English to arabic !!
    how I follow the steps which is provided but I got an error

    • Jason Brownlee November 27, 2018 at 6:38 am #

      Perhaps post your error to stackoverflow?

      • ssaa December 11, 2018 at 4:25 am #

        hello sir
        I got this result while running but does not apear probably

        train
        src=[], target=[continue digging], predicted=[i is to]
        src=[], target=[tom laid the gun down on the floor], predicted=[i is to]
        src=[], target=[i have to find it], predicted=[i is to]
        src=[], target=[i believe in god], predicted=[i is to]
        src=[], target=[im a free man], predicted=[i is to]
        src=[], target=[can i use my credit card], predicted=[i is to]
        src=[], target=[she is about to leave], predicted=[i is to]
        src=[], target=[she raised her hands], predicted=[i is to]
        src=[], target=[my uncle died a year ago], predicted=[i is to]
        src=[], target=[im sitting alone in my house], predicted=[i is to]
        /anaconda3/lib/python3.6/site-packages/nltk/translate/bleu_score.py:490: UserWarning:
        Corpus/Sentence contains 0 counts of 2-gram overlaps.
        BLEU scores might be undesirable; use SmoothingFunction().
        warnings.warn(_msg)
        BLEU-1: 0.266528
        BLEU-2: 0.516264
        BLEU-3: 0.672548
        BLEU-4: 0.718515
        test
        src=[], target=[im working in a town near rome], predicted=[i is to]
        src=[], target=[she despised him], predicted=[i is to]
        src=[], target=[the clock is ticking], predicted=[i is to]
        src=[], target=[this river is one mile across], predicted=[i is to]
        src=[], target=[birds of a feather flock together], predicted=[i is to]
        src=[], target=[why did you turn down his offer], predicted=[i is to]
        src=[], target=[shes as clever as they make em], predicted=[i is to]
        src=[], target=[how can i help], predicted=[i is to]
        src=[], target=[our living room is sunny], predicted=[i is to]
        src=[], target=[can you speak french], predicted=[i is to]
        BLEU-1: 0.260667
        BLEU-2: 0.510555
        BLEU-3: 0.668076
        BLEU-4: 0.714531

        • Jason Brownlee December 11, 2018 at 7:51 am #

          Perhaps try fitting the model again?

          • ssaa December 12, 2018 at 1:02 am #

            my dataset English-arabic
            when load it and clean the data I got this
            [hi] => []
            [run] => []
            [help] => []
            [jump] => []
            [stop] => []
            [go on] => []
            [go on] => []
            [hello] => []
            [hurry] => []
            [hurry] => []
            [i see] => []
            [i won] => []
            [relax] => []
            [smile] => []
            [cheers] => []
            [got it] => []
            [he ran] => []
            [i know] => []
            [i know] => []
            [i know] => []
            [im] => []
            [im ok] => []
            [listen] => []
            [no way] => []
            [really] => []
            [thanks] => []
            [why me] => []
            [awesome] => []
            [call me] => []
            [call me] => []
            [come in] => []
            [come in] => []
            [come on] => []
            [come on] => []
            [come on] => []
            [get out] => []
            [get out] => []
            [get out] => []
            [go away] => []
            [go away] => []
            [go away] => []
            [goodbye] => []
            [he came] => []
            [he runs] => []
            [help me] => []
            [help me] => []
            [im sad] => []
            [me too] => []
            [shut up] => []
            [shut up] => []
            [shut up] => []
            [shut up] => []
            [stop it] => []
            [take it] => []
            [tom won] => []
            [tom won] => []
            [wake up] => []
            [welcome] => []
            [welcome] => []
            [welcome] => []
            [welcome] => []
            [who won] => []
            [who won] => []
            [why not] => []
            [why not] => []
            [have fun] => []
            [hurry up] => []
            [i forgot] => []
            [i got it] => []
            [i got it] => []
            [i got it] => []
            [i use it] => []
            [ill pay] => []
            [im busy] => []
            [im busy] => []
            [im cold] => []
            [im free] => []
            [im here] => []
            [im home] => []
            [im poor] => []
            [im rich] => []
            [it hurts] => []
            [its hot] => []
            [its new] => []
            [lets go] => []
            [lets go] => []
            [lets go] => []
            [lets go] => []
            [lets go] => []
            [look out] => []
            [look out] => []
            [look out] => []
            [speak up] => []
            [stand up] => []
            [terrific] => []
            [terrific] => []
            [tom died] => []
            [tom died] => []
            [tom left] => []
            [tom lied] => []

          • Jason Brownlee December 12, 2018 at 5:55 am #

            Perhaps your model requires further tuning?

          • henok meskele December 27, 2018 at 12:53 am #

            l need Universal networking language based algorithms how it works and l want to integrate other algorithms with UNL framwork enco and deco functions

          • Jason Brownlee December 27, 2018 at 5:45 am #

            I don’t have material in that topic, sorry.

  80. Nikos November 29, 2018 at 9:07 am #

    Excellent work! Thank you, Jason!

  81. Naresh December 3, 2018 at 3:11 am #

    Can i use this model to train chinese to english translation, as chinese is something different then other language what precaution i need to take care.

  82. Sourabh December 5, 2018 at 12:15 am #

    Hello Sir, Thank you very much for this wonderful guide!!!
    I just have one doubt….Can we build a model which could translate both-ways…i.e. Language1 to Language2 and also Language2 to Language1?

  83. sree harsha December 11, 2018 at 12:39 am #

    Hi Jason, can you please clarify: in this model, are we giving the word embeddings as hidden state input to the encoder- lstm?

    Thanks in advance!

    • Jason Brownlee December 11, 2018 at 7:45 am #

      The embedding is provided as input to the LSTM, not hidden state.

      • sree harsha December 12, 2018 at 7:11 am #

        Thankyou for your reply 🙂 Is any direct input given to the second LSTM? or it receives only hidden input from the first one?