Generating gibberish text is a simple programming exercise for beginners. But completing a sentence meaningfully would require a lot of work. The landscape of auto-completion technology has transformed dramatically with the introduction of neural approaches. With Hugging Face’s transformers library, implementing text completion is only a few lines of code. In this comprehensive tutorial, you will implement several examples and explore how modern systems differ from traditional ones and why these differences matter.
Kick-start your project with my book NLP with Hugging Face Transformers. It provides self-study tutorials with working code.
Let’s get started.

Auto-Completion Style Text Generation with GPT-2 Model
Photo by Jayphen Simpson. Some rights reserved.
Overview
This post is in six parts; they are:
- Traditional vs Neural Approaches
- Auto-Complete Architecture
- Basic Auto-Complete Implementation
- Caching and Batched Input
Traditional vs Neural Approaches
When you type in a word in Google’s search bar, such as “machine”, you may find some additional words are suggested, such as “learning,” to make up “machine learning”. This is auto-complete technology. The suggestion may not be what you expect, but it is always coherent.
Traditional auto-complete systems have relied on relatively statistical methods. N-gram models predict the next word by looking at a fixed window of previous words and comparing them to collected samples. This method struggles with longer contexts and novel combinations. Dictionary-based approaches can only suggest words they’ve seen before, limiting their ability to handle new terminology. Frequency analysis provides suggestions based on common patterns but often misses the nuanced context of the current text.
Neural auto-complete systems, particularly those based on GPT-2, represent a fundamental shift in capability. These systems understand context instead of matching words. It considers the full scope of the previous text rather than just a few words. They grasp semantic relationships, enabling them to suggest completions that match not just the grammar but also the meaning of the text. The generative capability allows them to produce entire phrases or sentences that maintain coherence with the existing content.
Auto-Complete Architecture
A modern neural auto-complete system integrates several sophisticated components that work together seamlessly.
The language model serves as the cognitive engine. It processes input text and maintains an internal state to capture the nuances of the ongoing text generation process. The tokenization component acts as a bridge between human-readable text and the model’s numerical representations. The generation controller orchestrates the process, employing advanced strategies to filter and rank potential completions. It carefully balances response time and suggestion quality, ensuring users receive helpful completions without noticeable delay.
Developing an effective neural auto-complete system involves overcoming several critical challenges. Latency is a primary concern, as users expect a turn-around time in milliseconds while handling the computational complexity of neural network operations.
Quality control is another challenge. The system is expected to generate relevant suggestions, thus requiring advanced filtering mechanisms to prevent inappropriate completions and ensure suggestions align with the user’s domain and writing style.
Resource management is crucial when scaling the system to support multiple users. The substantial memory demands of neural models and the computational intensity of text generation must be carefully balanced against system resources and response time requirements.
Basic Auto-Complete Implementation
Let’s put aside the considerations of a larger system but focus on a simple auto-complete function on partial text. It is easy to implement using the pre-trained models from the transformers library:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
from transformers import GPT2LMHeadModel, GPT2Tokenizer import torch class AutoComplete: def __init__(self, model_name='gpt2'): """Initialize the auto-complete system.""" self.tokenizer = GPT2Tokenizer.from_pretrained(model_name) self.model = GPT2LMHeadModel.from_pretrained(model_name) self.device = 'cuda' if torch.cuda.is_available() else 'cpu' self.model.to(self.device) self.model.eval() # Set to evaluation mode def get_completion(self, text, max_length=50): """Generate completion for the input text.""" # Encode the input text inputs = self.tokenizer(text, add_special_tokens=False, return_tensors="pt") input_ids = inputs["input_ids"].to(self.device) attn_masks = inputs["attention_mask"].to(self.device) # Generate completion with torch.no_grad(): outputs = self.model.generate( input_ids, attention_mask=attn_masks, max_length=max_length, num_return_sequences=1, pad_token_id=self.tokenizer.eos_token_id, do_sample=True, temperature=0.7 ) # Decode and extract completion full_text = self.tokenizer.decode(outputs[0], skip_special_tokens=True) completion = full_text[len(text):] return completion # using autocomplete to see what we get auto_complete = AutoComplete() text = "The future of artificial" completion = auto_complete.get_completion(text) print(f"Input: {text}") print(f"Completion: {completion}") |
Let’s see what the above code does. You defined the class AutoComplete, which loads GPT2Tokenizer as the text tokenizer and GPT2LMHeadModel as a pre-trained GPT-2 model capable of text generation. The model is set to evaluation mode since you are using the model, not to train it.
Text generation is in function get_completion(). The input text is tokenized before passing on to the model. You invoke the model with torch.no_grad() context to skip the gradient calculation to save time and memory. Model is called with temperature=0.7 for balanced creativity. The model output needs to be converted back to text using the tokenizer. The other parameters in self.model.generate() are:
num_return_sequences=1to generate only one completion. The model can possibly generate multiple outputs for the same input.pad_token_id=self.tokenizer.eos_token_idprevents unnecessary paddingdo_sample=Trueenables sampling instead of deterministic text generation. You need this for creative generation.
Caching and Batched Input
The code above works as a simple program, but you need some polishing to run it as a service.
First, let’s implement a caching system to improve performance for real-time applications:
|
1 2 3 4 5 6 7 8 9 |
from functools import lru_cache class CachedAutoComplete(AutoComplete): def __init__(self, cache_size=1000, **kwargs): """Initialize with caching support.""" super().__init__(**kwargs) self.get_completion = lru_cache(maxsize=cache_size)( self.get_completion ) |
This builds upon the previous class by decorating the generation function with an LRU cache. The Python library handles the caching automatically. Simply use CachedAutoComplete instead of AutoComplete, and everything will work the same—except the cache will instantly return results for previously processed inputs.
Now, let’s optimize the system further for better real-time performance. One challenge in creating a service is handling multiple users simultaneously, making it beneficial to process multiple inputs as a batch. However, this increases memory usage. You can mitigate the additional workload by reducing the model size using 16-bit floats:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
class OptimizedAutoComplete(CachedAutoComplete): def __init__(self, **kwargs): """Initialize with optimizations.""" super().__init__(**kwargs) self.tokenizer.pad_token = self.tokenizer.eos_token if self.device == "cuda": self.model = self.model.half() # Use FP16 on GPU # use eval mode and cuda graphs self.model.eval() def preprocess_batch(self, texts): """Efficiently process multiple texts.""" # Tokenize all texts at once inputs = self.tokenizer(texts, padding=True, truncation=True, return_tensors="pt") return inputs.to(self.device) def generate_batch(self, texts, max_length=50): """Generate completions for multiple texts.""" # Preprocess batch inputs = self.preprocess_batch(texts) # Generate completions with torch.no_grad(): outputs = self.model.generate( inputs['input_ids'], attention_mask=inputs['attention_mask'], max_length=max_length, num_return_sequences=1, pad_token_id=self.tokenizer.eos_token_id, do_sample=True, temperature=0.7 ) # Decode completions completions = self.tokenizer.batch_decode(outputs, skip_special_tokens=True) # Extract new text results = [] for text, completion in zip(texts, completions): results.append(completion[len(text):]) return results |
Converting a model into 16-bit float is as simple as self.model = self.model.half() in the constructor. Most CPUs will not support 16-bit float. Hence, you should do that only if you can run the model on a GPU. Note that function generate_batch() is mostly the same as the previous generate() function, but you need to process and put the batched output into a list.
Below is the complete code, including how to use the batched generation:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
from functools import lru_cache from transformers import GPT2LMHeadModel, GPT2Tokenizer import torch class AutoComplete: def __init__(self, model_name="gpt2"): """Initialize the auto-complete system.""" self.tokenizer = GPT2Tokenizer.from_pretrained(model_name, padding_side="left") self.model = GPT2LMHeadModel.from_pretrained(model_name) self.device = "cuda" if torch.cuda.is_available() else "cpu" self.model.to(self.device) self.model.eval() # Set to evaluation mode def get_completion(self, text, max_length=50): """Generate completion for the input text.""" print("**** Completion:", text) # Encode the input text inputs = self.tokenizer(text, add_special_tokens=False, return_tensors="pt") input_ids = inputs["input_ids"].to(self.device) attn_masks = inputs["attention_mask"].to(self.device) # Generate completion with torch.no_grad(): outputs = self.model.generate( input_ids, attention_mask=attn_masks, max_length=max_length, num_return_sequences=1, pad_token_id=self.tokenizer.eos_token_id, do_sample=True, temperature=0.7 ) # Decode and extract completion full_text = self.tokenizer.decode(outputs[0], skip_special_tokens=True) completion = full_text[len(text):] return completion class CachedAutoComplete(AutoComplete): def __init__(self, cache_size=1000, **kwargs): """Initialize with caching support.""" super().__init__(**kwargs) self.get_completion = lru_cache(maxsize=cache_size)( self.get_completion ) class OptimizedAutoComplete(CachedAutoComplete): def __init__(self, **kwargs): """Initialize with optimizations.""" super().__init__(**kwargs) self.tokenizer.pad_token = self.tokenizer.eos_token if self.device == "cuda": self.model = self.model.half() # Use FP16 on GPU # use eval mode and cuda graphs self.model.eval() def preprocess_batch(self, texts): """Efficiently process multiple texts.""" # Tokenize all texts at once inputs = self.tokenizer(texts, padding=True, truncation=True, return_tensors="pt") return inputs.to(self.device) def generate_batch(self, texts, max_length=50): """Generate completions for multiple texts.""" # Preprocess batch inputs = self.preprocess_batch(texts) # Generate completions with torch.no_grad(): outputs = self.model.generate( inputs["input_ids"], attention_mask=inputs["attention_mask"], max_length=max_length, num_return_sequences=1, pad_token_id=self.tokenizer.eos_token_id, do_sample=True, temperature=0.7 ) # Decode completions completions = self.tokenizer.batch_decode(outputs, skip_special_tokens=True) # Extract new text results = [] for text, completion in zip(texts, completions): results.append(completion[len(text):]) return results # Example: Optimized batch completion optimized_complete = OptimizedAutoComplete() texts = [ "Machine learning is", "Deep neural networks can", "The training process involves" ] completions = optimized_complete.generate_batch(texts) for text, completion in zip(texts, completions): print(f"\nInput: {text}") print(f"Completion: {completion}") |
Summary
In this tutorial, you discovered how to build an intelligent auto-complete system using GPT-2. Specifically, you learned:
- The theory behind neural auto-completion systems
- How to implement basic auto-completion
- How to add caching for better performance
- How to make context-aware suggestions
- How to optimize for real-time use
The code examples are production-ready and can be used to build auto-complete applications.







No comments yet.