Language modeling is central to many important natural language processing tasks.
Recently, neural-network-based language models have demonstrated better performance than classical methods both standalone and as part of more challenging natural language processing tasks.
In this post, you will discover language modeling for natural language processing.
After reading this post, you will know:
- Why language modeling is critical to addressing tasks in natural language processing.
- What a language model is and some examples of where they are used.
- How neural networks can be used for language modeling.
Let’s get started.
This post is divided into 3 parts; they are:
- Problem of Modeling Language
- Statistical Language Modeling
- Neural Language Models
1. Problem of Modeling Language
Formal languages, like programming languages, can be fully specified.
All the reserved words can be defined and the valid ways that they can be used can be precisely defined.
We cannot do this with natural language. Natural languages are not designed; they emerge, and therefore there is no formal specification.
There may be formal rules for parts of the language, and heuristics, but natural language that does not confirm is often used. Natural languages involve vast numbers of terms that can be used in ways that introduce all kinds of ambiguities, yet can still be understood by other humans.
Further, languages change, word usages change: it is a moving target.
Nevertheless, linguists try to specify the language with formal grammars and structures. It can be done, but it is very difficult and the results can be fragile.
An alternative approach to specifying the model of the language is to learn it from examples.
2. Statistical Language Modeling
Statistical Language Modeling, or Language Modeling and LM for short, is the development of probabilistic models that are able to predict the next word in the sequence given the words that precede it.
Language modeling is the task of assigning a probability to sentences in a language. […] Besides assigning a probability to each sequence of words, the language models also assigns a probability for the likelihood of a given word (or a sequence of words) to follow a sequence of words
— Page 105, Neural Network Methods in Natural Language Processing, 2017.
A language model learns the probability of word occurrence based on examples of text. Simpler models may look at a context of a short sequence of words, whereas larger models may work at the level of sentences or paragraphs. Most commonly, language models operate at the level of words.
The notion of a language model is inherently probabilistic. A language model is a function that puts a probability measure over strings drawn from some vocabulary.
— Page 238, An Introduction to Information Retrieval, 2008.
A language model can be developed and used standalone, such as to generate new sequences of text that appear to have come from the corpus.
Language modeling is a root problem for a large range of natural language processing tasks. More practically, language models are used on the front-end or back-end of a more sophisticated model for a task that requires language understanding.
… language modeling is a crucial component in real-world applications such as machine-translation and automatic speech recognition, […] For these reasons, language modeling plays a central role in natural-language processing, AI, and machine-learning research.
— Page 105, Neural Network Methods in Natural Language Processing, 2017.
A good example is speech recognition, where audio data is used as an input to the model and the output requires a language model that interprets the input signal and recognizes each new word within the context of the words already recognized.
Speech recognition is principally concerned with the problem of transcribing the speech signal as a sequence of words. […] From this point of view, speech is assumed to be a generated by a language model which provides estimates of Pr(w) for all word strings w independently of the observed signal […] THe goal of speech recognition is to find the most likely word sequence given the observed acoustic signal.
— Pages 205-206, The Oxford Handbook of Computational Linguistics, 2005.
Similarly, language models are used to generate text in many similar natural language processing tasks, for example:
- Optical Character Recognition
- Handwriting Recognition.
- Machine Translation.
- Spelling Correction.
- Image Captioning.
- Text Summarization
- And much more.
Language modeling is the art of determining the probability of a sequence of words. This is useful in a large variety of areas including speech recognition, optical character recognition, handwriting recognition, machine translation, and spelling correction
Developing better language models often results in models that perform better on their intended natural language processing task. This is the motivation for developing better and more accurate language models.
[language models] have played a key role in traditional NLP tasks such as speech recognition, machine translation, or text summarization. Often (although not always), training better language models improves the underlying metrics of the downstream task (such as word error rate for speech recognition, or BLEU score for translation), which makes the task of training better LMs valuable by itself.
3. Neural Language Models
Recently, the use of neural networks in the development of language models has become very popular, to the point that it may now be the preferred approach.
The use of neural networks in language modeling is often called Neural Language Modeling, or NLM for short.
Neural network approaches are achieving better results than classical methods both on standalone language models and when models are incorporated into larger models on challenging tasks like speech recognition and machine translation.
A key reason for the leaps in improved performance may be the method’s ability to generalize.
Nonlinear neural network models solve some of the shortcomings of traditional language models: they allow conditioning on increasingly large context sizes with only a linear increase in the number of parameters, they alleviate the need for manually designing backoff orders, and they support generalization across different contexts.
— Page 109, Neural Network Methods in Natural Language Processing, 2017.
Specifically, a word embedding is adopted that uses a real-valued vector to represent each word in a project vector space. This learned representation of words based on their usage allows words with a similar meaning to have a similar representation.
Neural Language Models (NLM) address the n-gram data sparsity issue through parameterization of words as vectors (word embeddings) and using them as inputs to a neural network. The parameters are learned as part of the training process. Word embeddings obtained through NLMs exhibit the property whereby semantically close words are likewise close in the induced vector space.
This generalization is something that the representation used in classical statistical language models can not easily achieve.
“True generalization” is difficult to obtain in a discrete word indice space, since there is no obvious relation between the word indices.
Further, the distributed representation approach allows the embedding representation to scale better with the size of the vocabulary. Classical methods that have one discrete representation per word fight the curse of dimensionality with larger and larger vocabularies of words that result in longer and more sparse representations.
The neural network approach to language modeling can be described using the three following model properties, taken from “A Neural Probabilistic Language Model“, 2003.
- Associate each word in the vocabulary with a distributed word feature vector.
- Express the joint probability function of word sequences in terms of the feature vectors of these words in the sequence.
- Learn simultaneously the word feature vector and the parameters of the probability function.
This represents a relatively simple model where both the representation and probabilistic model are learned together directly from raw text data.
Recently, the neural based approaches have started to and then consistently started to outperform the classical statistical approaches.
We provide ample empirical evidence to suggest that connectionist language models are superior to standard n-gram techniques, except their high computational (training) complexity.
Initially, feed-forward neural network models were used to introduce the approach.
More recently, recurrent neural networks and then networks with a long-term memory like the Long Short-Term Memory network, or LSTM, allow the models to learn the relevant context over much longer input sequences than the simpler feed-forward networks.
[an RNN language model] provides further generalization: instead of considering just several preceding words, neurons with input from recurrent connections are assumed to represent short term memory. The model learns itself from the data how to represent memory. While shallow feedforward neural networks (those with just one hidden layer) can only cluster similar words, recurrent neural network (which can be considered as a deep architecture) can perform clustering of similar histories. This allows for instance efficient representation of patterns with variable length.
Recently, researchers have been seeking the limits of these language models. In the paper “Exploring the Limits of Language Modeling“, evaluating language models over large datasets, such as the corpus of one million words, the authors find that LSTM-based neural language models out-perform the classical methods.
… we have shown that RNN LMs can be trained on large amounts of data, and outperform competing models including carefully tuned N-grams.
Further, they propose some heuristics for developing high-performing neural language models in general:
- Size matters. The best models were the largest models, specifically number of memory units.
- Regularization matters. Use of regularization like dropout on input connections improves results.
- CNNs vs Embeddings. Character-level Convolutional Neural Network (CNN) models can be used on the front-end instead of word embeddings, achieving similar and sometimes better results.
- Ensembles matter. Combining the prediction from multiple models can offer large improvements in model performance.
This section provides more resources on the topic if you are looking go deeper.
- Chapter 9 Language Modeling, Neural Network Methods in Natural Language Processing, 2017.
- Chapter 22, Natural Language Processing, Artificial Intelligence A Modern Approach, 2009.
- Chapter 12, Language models for information retrieval, An Introduction to Information Retrieval, 2008.
- A Neural Probabilistic Language Model, NIPS, 2001.
- A Neural Probabilistic Language Model, JMLR, 2003.
- Connectionist language modeling for large vocabulary continuous speech recognition, 2002.
- Recurrent neural network based language model, 2010.
- Extensions of recurrent neural network language model, 2011.
- Character-Aware Neural Language Model, 2015.
- LSTM Neural Networks for Language Modeling, 2012
- Exploring the Limits of Language Modeling, 2016.
In this post, you discovered language modeling for natural language processing tasks.
Specifically, you learned:
- That natural language is not formally specified and requires the use of statistical models to learn from examples.
- That statistical language models are central to many challenging natural language processing tasks.
- That state-of-the-art results are achieved using neural language models, specifically those with word embeddings and recurrent neural network algorithms.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.