Auto-Completion Style Text Generation with GPT-2 Model

Generating gibberish text is a simple programming exercise for beginners. But completing a sentence meaningfully would require a lot of work. The landscape of auto-completion technology has transformed dramatically with the introduction of neural approaches. With Hugging Face’s transformers library, implementing text completion is only a few lines of code. In this comprehensive tutorial, you will implement several examples and explore how modern systems differ from traditional ones and why these differences matter.

Kick-start your project with my book NLP with Hugging Face Transformers. It provides self-study tutorials with working code.

Let’s get started.

Auto-Completion Style Text Generation with GPT-2 Model
Photo by Jayphen Simpson. Some rights reserved.

Overview

This post is in six parts; they are:

  • Traditional vs Neural Approaches
  • Auto-Complete Architecture
  • Basic Auto-Complete Implementation
  • Caching and Batched Input

Traditional vs Neural Approaches

When you type in a word in Google’s search bar, such as “machine”, you may find some additional words are suggested, such as “learning,” to make up “machine learning”. This is auto-complete technology. The suggestion may not be what you expect, but it is always coherent.

Traditional auto-complete systems have relied on relatively statistical methods. N-gram models predict the next word by looking at a fixed window of previous words and comparing them to collected samples. This method struggles with longer contexts and novel combinations. Dictionary-based approaches can only suggest words they’ve seen before, limiting their ability to handle new terminology. Frequency analysis provides suggestions based on common patterns but often misses the nuanced context of the current text.

Neural auto-complete systems, particularly those based on GPT-2, represent a fundamental shift in capability. These systems understand context instead of matching words. It considers the full scope of the previous text rather than just a few words. They grasp semantic relationships, enabling them to suggest completions that match not just the grammar but also the meaning of the text. The generative capability allows them to produce entire phrases or sentences that maintain coherence with the existing content.

Auto-Complete Architecture

A modern neural auto-complete system integrates several sophisticated components that work together seamlessly.

The language model serves as the cognitive engine. It processes input text and maintains an internal state to capture the nuances of the ongoing text generation process. The tokenization component acts as a bridge between human-readable text and the model’s numerical representations. The generation controller orchestrates the process, employing advanced strategies to filter and rank potential completions. It carefully balances response time and suggestion quality, ensuring users receive helpful completions without noticeable delay.

Developing an effective neural auto-complete system involves overcoming several critical challenges. Latency is a primary concern, as users expect a turn-around time in milliseconds while handling the computational complexity of neural network operations.

Quality control is another challenge. The system is expected to generate relevant suggestions, thus requiring advanced filtering mechanisms to prevent inappropriate completions and ensure suggestions align with the user’s domain and writing style.

Resource management is crucial when scaling the system to support multiple users. The substantial memory demands of neural models and the computational intensity of text generation must be carefully balanced against system resources and response time requirements.

Basic Auto-Complete Implementation

Let’s put aside the considerations of a larger system but focus on a simple auto-complete function on partial text. It is easy to implement using the pre-trained models from the transformers library:

Let’s see what the above code does. You defined the class AutoComplete, which loads GPT2Tokenizer as the text tokenizer and GPT2LMHeadModel as a pre-trained GPT-2 model capable of text generation. The model is set to evaluation mode since you are using the model, not to train it.

Text generation is in function get_completion(). The input text is tokenized before passing on to the model. You invoke the model with torch.no_grad() context to skip the gradient calculation to save time and memory. Model is called with temperature=0.7 for balanced creativity. The model output needs to be converted back to text using the tokenizer. The other parameters in self.model.generate() are:

  • num_return_sequences=1 to generate only one completion. The model can possibly generate multiple outputs for the same input.
  • pad_token_id=self.tokenizer.eos_token_id prevents unnecessary padding
  • do_sample=True enables sampling instead of deterministic text generation. You need this for creative generation.

Caching and Batched Input

The code above works as a simple program, but you need some polishing to run it as a service.

First, let’s implement a caching system to improve performance for real-time applications:

This builds upon the previous class by decorating the generation function with an LRU cache. The Python library handles the caching automatically. Simply use CachedAutoComplete instead of AutoComplete, and everything will work the same—except the cache will instantly return results for previously processed inputs.

Now, let’s optimize the system further for better real-time performance. One challenge in creating a service is handling multiple users simultaneously, making it beneficial to process multiple inputs as a batch. However, this increases memory usage. You can mitigate the additional workload by reducing the model size using 16-bit floats:

Converting a model into 16-bit float is as simple as self.model = self.model.half() in the constructor. Most CPUs will not support 16-bit float. Hence, you should do that only if you can run the model on a GPU. Note that function generate_batch() is mostly the same as the previous generate() function, but you need to process and put the batched output into a list.

Below is the complete code, including how to use the batched generation:

Summary

In this tutorial, you discovered how to build an intelligent auto-complete system using GPT-2. Specifically, you learned:

  • The theory behind neural auto-completion systems
  • How to implement basic auto-completion
  • How to add caching for better performance
  • How to make context-aware suggestions
  • How to optimize for real-time use

The code examples are production-ready and can be used to build auto-complete applications.

 

 

Want to Use Powerful Language Models in Your NLP Projects?

NLP with Hugging Face Transformers

Run State-of-the-Art Models on Your Own Machine

...with just a few lines of Python code

Discover how in my new Ebook:
NLP with Hugging Face Transformers

It covers hands-on examples and real-world use cases on tasks like: Text classification, summarization, translation, Q+A, and much more...

Finally Bring Advanced NLP to
Your Own Projects

No theory. Just Practical, Working Code

See What's Inside

No comments yet.

Leave a Reply

Machine Learning Mastery is part of Guiding Tech Media, a leading digital media publisher focused on helping people figure out technology. Visit our corporate website to learn more about our mission and team.