Gentle Introduction to Statistical Language Modeling and Neural Language Models

Language modeling is central to many important natural language processing tasks.

Recently, neural-network-based language models have demonstrated better performance than classical methods both standalone and as part of more challenging natural language processing tasks.

In this post, you will discover language modeling for natural language processing.

After reading this post, you will know:

  • Why language modeling is critical to addressing tasks in natural language processing.
  • What a language model is and some examples of where they are used.
  • How neural networks can be used for language modeling.

Kick-start your project with my new book Deep Learning for Natural Language Processing, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

  • Updated Jun/2019: Added links to step-by-step language model tutorials.
Gentle Introduction to Statistical Language Modeling and Neural Language Models

Gentle Introduction to Statistical Language Modeling and Neural Language Models
Photo by Chris Sorge, some rights reserved.


This post is divided into 3 parts; they are:

  1. Problem of Modeling Language
  2. Statistical Language Modeling
  3. Neural Language Models

Need help with Deep Learning for Text Data?

Take my free 7-day email crash course now (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

1. Problem of Modeling Language

Formal languages, like programming languages, can be fully specified.

All the reserved words can be defined and the valid ways that they can be used can be precisely defined.

We cannot do this with natural language. Natural languages are not designed; they emerge, and therefore there is no formal specification.

There may be formal rules for parts of the language, and heuristics, but natural language that does not confirm is often used. Natural languages involve vast numbers of terms that can be used in ways that introduce all kinds of ambiguities, yet can still be understood by other humans.

Further, languages change, word usages change: it is a moving target.

Nevertheless, linguists try to specify the language with formal grammars and structures. It can be done, but it is very difficult and the results can be fragile.

An alternative approach to specifying the model of the language is to learn it from examples.

2. Statistical Language Modeling

Statistical Language Modeling, or Language Modeling and LM for short, is the development of probabilistic models that are able to predict the next word in the sequence given the words that precede it.

Language modeling is the task of assigning a probability to sentences in a language. […] Besides assigning a probability to each sequence of words, the language models also assigns a probability for the likelihood of a given word (or a sequence of words) to follow a sequence of words

— Page 105, Neural Network Methods in Natural Language Processing, 2017.

A language model learns the probability of word occurrence based on examples of text. Simpler models may look at a context of a short sequence of words, whereas larger models may work at the level of sentences or paragraphs. Most commonly, language models operate at the level of words.

The notion of a language model is inherently probabilistic. A language model is a function that puts a probability measure over strings drawn from some vocabulary.

— Page 238, An Introduction to Information Retrieval, 2008.

A language model can be developed and used standalone, such as to generate new sequences of text that appear to have come from the corpus.

Language modeling is a root problem for a large range of natural language processing tasks. More practically, language models are used on the front-end or back-end of a more sophisticated model for a task that requires language understanding.

… language modeling is a crucial component in real-world applications such as machine-translation and automatic speech recognition, […] For these reasons, language modeling plays a central role in natural-language processing, AI, and machine-learning research.

— Page 105, Neural Network Methods in Natural Language Processing, 2017.

A good example is speech recognition, where audio data is used as an input to the model and the output requires a language model that interprets the input signal and recognizes each new word within the context of the words already recognized.

Speech recognition is principally concerned with the problem of transcribing the speech signal as a sequence of words. […] From this point of view, speech is assumed to be a generated by a language model which provides estimates of Pr(w) for all word strings w independently of the observed signal […] THe goal of speech recognition is to find the most likely word sequence given the observed acoustic signal.

— Pages 205-206, The Oxford Handbook of Computational Linguistics, 2005.

Similarly, language models are used to generate text in many similar natural language processing tasks, for example:

  • Optical Character Recognition
  • Handwriting Recognition.
  • Machine Translation.
  • Spelling Correction.
  • Image Captioning.
  • Text Summarization
  • And much more.

Language modeling is the art of determining the probability of a sequence of words. This is useful in a large variety of areas including speech recognition, optical character recognition, handwriting recognition, machine translation, and spelling correction

A Bit of Progress in Language Modeling, 2001.

Developing better language models often results in models that perform better on their intended natural language processing task. This is the motivation for developing better and more accurate language models.

[language models] have played a key role in traditional NLP tasks such as speech recognition, machine translation, or text summarization. Often (although not always), training better language models improves the underlying metrics of the downstream task (such as word error rate for speech recognition, or BLEU score for translation), which makes the task of training better LMs valuable by itself.

Exploring the Limits of Language Modeling, 2016.

3. Neural Language Models

Recently, the use of neural networks in the development of language models has become very popular, to the point that it may now be the preferred approach.

The use of neural networks in language modeling is often called Neural Language Modeling, or NLM for short.

Neural network approaches are achieving better results than classical methods both on standalone language models and when models are incorporated into larger models on challenging tasks like speech recognition and machine translation.

A key reason for the leaps in improved performance may be the method’s ability to generalize.

Nonlinear neural network models solve some of the shortcomings of traditional language models: they allow conditioning on increasingly large context sizes with only a linear increase in the number of parameters, they alleviate the need for manually designing backoff orders, and they support generalization across different contexts.

— Page 109, Neural Network Methods in Natural Language Processing, 2017.

Specifically, a word embedding is adopted that uses a real-valued vector to represent each word in a project vector space. This learned representation of words based on their usage allows words with a similar meaning to have a similar representation.

Neural Language Models (NLM) address the n-gram data sparsity issue through parameterization of words as vectors (word embeddings) and using them as inputs to a neural network. The parameters are learned as part of the training process. Word embeddings obtained through NLMs exhibit the property whereby semantically close words are likewise close in the induced vector space.

Character-Aware Neural Language Model, 2015.

This generalization is something that the representation used in classical statistical language models can not easily achieve.

“True generalization” is difficult to obtain in a discrete word indice space, since there is no obvious relation between the word indices.

Connectionist language modeling for large vocabulary continuous speech recognition, 2002.

Further, the distributed representation approach allows the embedding representation to scale better with the size of the vocabulary. Classical methods that have one discrete representation per word fight the curse of dimensionality with larger and larger vocabularies of words that result in longer and more sparse representations.

The neural network approach to language modeling can be described using the three following model properties, taken from “A Neural Probabilistic Language Model“, 2003.

  1. Associate each word in the vocabulary with a distributed word feature vector.
  2. Express the joint probability function of word sequences in terms of the feature vectors of these words in the sequence.
  3. Learn simultaneously the word feature vector and the parameters of the probability function.

This represents a relatively simple model where both the representation and probabilistic model are learned together directly from raw text data.

Recently, the neural based approaches have started to and then consistently started to outperform the classical statistical approaches.

We provide ample empirical evidence to suggest that connectionist language models are superior to standard n-gram techniques, except their high computational (training) complexity.

Recurrent neural network based language model, 2010.

Initially, feed-forward neural network models were used to introduce the approach.

More recently, recurrent neural networks and then networks with a long-term memory like the Long Short-Term Memory network, or LSTM, allow the models to learn the relevant context over much longer input sequences than the simpler feed-forward networks.

[an RNN language model] provides further generalization: instead of considering just several preceding words, neurons with input from recurrent connections are assumed to represent short term memory. The model learns itself from the data how to represent memory. While shallow feedforward neural networks (those with just one hidden layer) can only cluster similar words, recurrent neural network (which can be considered as a deep architecture) can perform clustering of similar histories. This allows for instance efficient representation of patterns with variable length.

Extensions of recurrent neural network language model, 2011.

Recently, researchers have been seeking the limits of these language models. In the paper “Exploring the Limits of Language Modeling“, evaluating language models over large datasets, such as the corpus of one million words, the authors find that LSTM-based neural language models out-perform the classical methods.

… we have shown that RNN LMs can be trained on large amounts of data, and outperform competing models including carefully tuned N-grams.

Exploring the Limits of Language Modeling, 2016.

Further, they propose some heuristics for developing high-performing neural language models in general:

  • Size matters. The best models were the largest models, specifically number of memory units.
  • Regularization matters. Use of regularization like dropout on input connections improves results.
  • CNNs vs Embeddings. Character-level Convolutional Neural Network (CNN) models can be used on the front-end instead of word embeddings, achieving similar and sometimes better results.
  • Ensembles matter. Combining the prediction from multiple models can offer large improvements in model performance.

Language Model Tutorials

This section lists some step-by-step tutorials for developing deep learning neural network language models.

Further Reading

This section provides more resources on the topic if you are looking go deeper.





In this post, you discovered language modeling for natural language processing tasks.

Specifically, you learned:

  • That natural language is not formally specified and requires the use of statistical models to learn from examples.
  • That statistical language models are central to many challenging natural language processing tasks.
  • That state-of-the-art results are achieved using neural language models, specifically those with word embeddings and recurrent neural network algorithms.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Develop Deep Learning models for Text Data Today!

Deep Learning for Natural Language Processing

Develop Your Own Text models in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Deep Learning for Natural Language Processing

It provides self-study tutorials on topics like:
Bag-of-Words, Word Embedding, Language Models, Caption Generation, Text Translation and much more...

Finally Bring Deep Learning to your Natural Language Processing Projects

Skip the Academics. Just Results.

See What's Inside

10 Responses to Gentle Introduction to Statistical Language Modeling and Neural Language Models

  1. Avatar
    nigus July 9, 2018 at 5:24 pm #

    Hello Dear Dr. Jason, I have been followed your tutorial, and it is so interesting.
    now, I have the following questions on the topic of OCR.
    1. could you give me a simple example how to implement CNN and LSTM for text image recognition( e.g if the image is ” playing foot ball” and the equivalent text is ‘playing foot ball’ the how to give the image and the text for training?) please?

  2. Avatar
    Bhaarat September 16, 2018 at 11:56 am #

    Nice article, references helped a lot, however, I was hoping to read all about the LM at one place switching between papers and reading them, makes me lose the grip on the topic. I know, it’s not the article’s fault but I would be extremely happy if you have explained this topic in your own words as you usually do. Anyways, thanks for putting up this.

  3. Avatar
    satish October 5, 2018 at 4:06 am #

    Hi Jason,

    Thanks for this beautiful post. Is the NLM still an active area of research? or did we reach some saturation?

    • Avatar
      Jason Brownlee October 5, 2018 at 5:39 am #


      I believe so, check on

  4. Avatar
    Teshome Bekele August 18, 2019 at 9:08 pm #

    Hello Jason Brownlee

    I am Teshome From Ethiopia, I am a beginner for word embedding so how to start from scratch?

  5. Avatar
    Angela April 19, 2020 at 10:20 am #

    Hi Jason,

    Thanks for your blog post. This is so informative! Love your blog in general.

    I don’t quite understand #3 in this three-step approach:

    1. Associate each word in the vocabulary with a distributed word feature vector.

    2. Express the joint probability function of word sequences in terms of the feature vectors of these words in the sequence.

    3. Learn simultaneously the word feature vector and the parameters of the probability function.

    Why does the word feature vector need to be trained if they are pre-trained word embeddings? Is it because they still need to be trained for the final task? In addition, what are the parameters of the probability function? What is the probability function?

Leave a Reply