Auto-Completion Style Text Generation with GPT-2 Model

By Muhammad Asad Iqbal Khan on May 15, 2025 in Hugging Face Transformers 0

Generating gibberish text is a simple programming exercise for beginners. But completing a sentence meaningfully would require a lot of work. The landscape of auto-completion technology has transformed dramatically with the introduction of neural approaches. With Hugging Face’s transformers library, implementing text completion is only a few lines of code. In this comprehensive tutorial, you will implement several examples and explore how modern systems differ from traditional ones and why these differences matter.

Kick-start your project with my book NLP with Hugging Face Transformers. It provides self-study tutorials with working code.

Let’s get started.

Auto-Completion Style Text Generation with GPT-2 Model
Photo by Jayphen Simpson. Some rights reserved.

Overview

This post is in six parts; they are:

Traditional vs Neural Approaches
Auto-Complete Architecture
Basic Auto-Complete Implementation
Caching and Batched Input

Traditional vs Neural Approaches

When you type in a word in Google’s search bar, such as “machine”, you may find some additional words are suggested, such as “learning,” to make up “machine learning”. This is auto-complete technology. The suggestion may not be what you expect, but it is always coherent.

Traditional auto-complete systems have relied on relatively statistical methods. N-gram models predict the next word by looking at a fixed window of previous words and comparing them to collected samples. This method struggles with longer contexts and novel combinations. Dictionary-based approaches can only suggest words they’ve seen before, limiting their ability to handle new terminology. Frequency analysis provides suggestions based on common patterns but often misses the nuanced context of the current text.

Neural auto-complete systems, particularly those based on GPT-2, represent a fundamental shift in capability. These systems understand context instead of matching words. It considers the full scope of the previous text rather than just a few words. They grasp semantic relationships, enabling them to suggest completions that match not just the grammar but also the meaning of the text. The generative capability allows them to produce entire phrases or sentences that maintain coherence with the existing content.

Auto-Complete Architecture

A modern neural auto-complete system integrates several sophisticated components that work together seamlessly.

The language model serves as the cognitive engine. It processes input text and maintains an internal state to capture the nuances of the ongoing text generation process. The tokenization component acts as a bridge between human-readable text and the model’s numerical representations. The generation controller orchestrates the process, employing advanced strategies to filter and rank potential completions. It carefully balances response time and suggestion quality, ensuring users receive helpful completions without noticeable delay.

Developing an effective neural auto-complete system involves overcoming several critical challenges. Latency is a primary concern, as users expect a turn-around time in milliseconds while handling the computational complexity of neural network operations.

Quality control is another challenge. The system is expected to generate relevant suggestions, thus requiring advanced filtering mechanisms to prevent inappropriate completions and ensure suggestions align with the user’s domain and writing style.

Resource management is crucial when scaling the system to support multiple users. The substantial memory demands of neural models and the computational intensity of text generation must be carefully balanced against system resources and response time requirements.

Basic Auto-Complete Implementation

Let’s put aside the considerations of a larger system but focus on a simple auto-complete function on partial text. It is easy to implement using the pre-trained models from the transformers library:

from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch

class AutoComplete:
    def __init__(self, model_name='gpt2'):
        """Initialize the auto-complete system."""
        self.tokenizer = GPT2Tokenizer.from_pretrained(model_name)
        self.model = GPT2LMHeadModel.from_pretrained(model_name)
        self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
        self.model.to(self.device)
        self.model.eval()  # Set to evaluation mode

    def get_completion(self, text, max_length=50):
        """Generate completion for the input text."""
        # Encode the input text
        inputs = self.tokenizer(text, add_special_tokens=False, return_tensors="pt")
        input_ids = inputs["input_ids"].to(self.device)
        attn_masks = inputs["attention_mask"].to(self.device)

        # Generate completion
        with torch.no_grad():
            outputs = self.model.generate(
                input_ids,
                attention_mask=attn_masks,
                max_length=max_length,
                num_return_sequences=1,
                pad_token_id=self.tokenizer.eos_token_id,
                do_sample=True,
                temperature=0.7
            )

        # Decode and extract completion
        full_text = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
        completion = full_text[len(text):]

        return completion

# using autocomplete to see what we get
auto_complete = AutoComplete()
text = "The future of artificial"
completion = auto_complete.get_completion(text)
print(f"Input: {text}")
print(f"Completion: {completion}")

from transformers import GPT2LMHeadModel, GPT2Tokenizer

import torch

class AutoComplete:

def __init__(self, model_name='gpt2'):

"""Initialize the auto-complete system."""

self.tokenizer = GPT2Tokenizer.from_pretrained(model_name)

self.model = GPT2LMHeadModel.from_pretrained(model_name)

self.device = 'cuda' if torch.cuda.is_available() else 'cpu'

self.model.to(self.device)

self.model.eval() # Set to evaluation mode

def get_completion(self, text, max_length=50):

"""Generate completion for the input text."""

# Encode the input text

inputs = self.tokenizer(text, add_special_tokens=False, return_tensors="pt")

input_ids = inputs["input_ids"].to(self.device)

attn_masks = inputs["attention_mask"].to(self.device)

# Generate completion

with torch.no_grad():

outputs = self.model.generate(

input_ids,

attention_mask=attn_masks,

max_length=max_length,

num_return_sequences=1,

pad_token_id=self.tokenizer.eos_token_id,

do_sample=True,

temperature=0.7

)

# Decode and extract completion

full_text = self.tokenizer.decode(outputs[0], skip_special_tokens=True)

completion = full_text[len(text):]

return completion

# using autocomplete to see what we get

auto_complete = AutoComplete()

text = "The future of artificial"

completion = auto_complete.get_completion(text)

print(f"Input: {text}")

print(f"Completion: {completion}")

Let’s see what the above code does. You defined the class AutoComplete, which loads GPT2Tokenizer as the text tokenizer and GPT2LMHeadModel as a pre-trained GPT-2 model capable of text generation. The model is set to evaluation mode since you are using the model, not to train it.

Text generation is in function get_completion(). The input text is tokenized before passing on to the model. You invoke the model with torch.no_grad() context to skip the gradient calculation to save time and memory. Model is called with temperature=0.7 for balanced creativity. The model output needs to be converted back to text using the tokenizer. The other parameters in self.model.generate() are:

num_return_sequences=1 to generate only one completion. The model can possibly generate multiple outputs for the same input.
pad_token_id=self.tokenizer.eos_token_id prevents unnecessary padding
do_sample=True enables sampling instead of deterministic text generation. You need this for creative generation.

Caching and Batched Input

The code above works as a simple program, but you need some polishing to run it as a service.

First, let’s implement a caching system to improve performance for real-time applications:

from functools import lru_cache

class CachedAutoComplete(AutoComplete):
    def __init__(self, cache_size=1000, **kwargs):
        """Initialize with caching support."""
        super().__init__(**kwargs)
        self.get_completion = lru_cache(maxsize=cache_size)(
            self.get_completion
        )

from functools import lru_cache

class CachedAutoComplete(AutoComplete):

def __init__(self, cache_size=1000, **kwargs):

"""Initialize with caching support."""

super().__init__(**kwargs)

self.get_completion = lru_cache(maxsize=cache_size)(

self.get_completion

)

This builds upon the previous class by decorating the generation function with an LRU cache. The Python library handles the caching automatically. Simply use CachedAutoComplete instead of AutoComplete, and everything will work the same—except the cache will instantly return results for previously processed inputs.

Now, let’s optimize the system further for better real-time performance. One challenge in creating a service is handling multiple users simultaneously, making it beneficial to process multiple inputs as a batch. However, this increases memory usage. You can mitigate the additional workload by reducing the model size using 16-bit floats:

class OptimizedAutoComplete(CachedAutoComplete):
    def __init__(self, **kwargs):
        """Initialize with optimizations."""
        super().__init__(**kwargs)
        self.tokenizer.pad_token = self.tokenizer.eos_token

        if self.device == "cuda":
            self.model = self.model.half()  # Use FP16 on GPU

        # use eval mode and cuda graphs
        self.model.eval()

    def preprocess_batch(self, texts):
        """Efficiently process multiple texts."""
        # Tokenize all texts at once
        inputs = self.tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
        return inputs.to(self.device)

    def generate_batch(self, texts, max_length=50):
        """Generate completions for multiple texts."""
        # Preprocess batch
        inputs = self.preprocess_batch(texts)

        # Generate completions
        with torch.no_grad():
            outputs = self.model.generate(
                inputs['input_ids'],
                attention_mask=inputs['attention_mask'],
                max_length=max_length,
                num_return_sequences=1,
                pad_token_id=self.tokenizer.eos_token_id,
                do_sample=True,
                temperature=0.7
            )

        # Decode completions
        completions = self.tokenizer.batch_decode(outputs, skip_special_tokens=True)

        # Extract new text
        results = []
        for text, completion in zip(texts, completions):
            results.append(completion[len(text):])

        return results

class OptimizedAutoComplete(CachedAutoComplete):

def __init__(self, **kwargs):

"""Initialize with optimizations."""

super().__init__(**kwargs)

self.tokenizer.pad_token = self.tokenizer.eos_token

if self.device == "cuda":

self.model = self.model.half() # Use FP16 on GPU

# use eval mode and cuda graphs

self.model.eval()

def preprocess_batch(self, texts):

"""Efficiently process multiple texts."""

# Tokenize all texts at once

inputs = self.tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

return inputs.to(self.device)

def generate_batch(self, texts, max_length=50):

"""Generate completions for multiple texts."""

# Preprocess batch

inputs = self.preprocess_batch(texts)

# Generate completions

with torch.no_grad():

outputs = self.model.generate(

inputs['input_ids'],

attention_mask=inputs['attention_mask'],

max_length=max_length,

num_return_sequences=1,

pad_token_id=self.tokenizer.eos_token_id,

do_sample=True,

temperature=0.7

)

# Decode completions

completions = self.tokenizer.batch_decode(outputs, skip_special_tokens=True)

# Extract new text

results = []

for text, completion in zip(texts, completions):

results.append(completion[len(text):])

return results

Converting a model into 16-bit float is as simple as self.model = self.model.half() in the constructor. Most CPUs will not support 16-bit float. Hence, you should do that only if you can run the model on a GPU. Note that function generate_batch() is mostly the same as the previous generate() function, but you need to process and put the batched output into a list.

Below is the complete code, including how to use the batched generation:

from functools import lru_cache
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch


class AutoComplete:
    def __init__(self, model_name="gpt2"):
        """Initialize the auto-complete system."""
        self.tokenizer = GPT2Tokenizer.from_pretrained(model_name, padding_side="left")
        self.model = GPT2LMHeadModel.from_pretrained(model_name)
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.model.to(self.device)
        self.model.eval()  # Set to evaluation mode

    def get_completion(self, text, max_length=50):
        """Generate completion for the input text."""
        print("**** Completion:", text)
        # Encode the input text
        inputs = self.tokenizer(text, add_special_tokens=False, return_tensors="pt")
        input_ids = inputs["input_ids"].to(self.device)
        attn_masks = inputs["attention_mask"].to(self.device)

        # Generate completion
        with torch.no_grad():
            outputs = self.model.generate(
                input_ids,
                attention_mask=attn_masks,
                max_length=max_length,
                num_return_sequences=1,
                pad_token_id=self.tokenizer.eos_token_id,
                do_sample=True,
                temperature=0.7
            )

        # Decode and extract completion
        full_text = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
        completion = full_text[len(text):]

        return completion


class CachedAutoComplete(AutoComplete):
    def __init__(self, cache_size=1000, **kwargs):
        """Initialize with caching support."""
        super().__init__(**kwargs)
        self.get_completion = lru_cache(maxsize=cache_size)(
            self.get_completion
        )


class OptimizedAutoComplete(CachedAutoComplete):
    def __init__(self, **kwargs):
        """Initialize with optimizations."""
        super().__init__(**kwargs)
        self.tokenizer.pad_token = self.tokenizer.eos_token

        if self.device == "cuda":
            self.model = self.model.half()  # Use FP16 on GPU

        # use eval mode and cuda graphs
        self.model.eval()

    def preprocess_batch(self, texts):
        """Efficiently process multiple texts."""
        # Tokenize all texts at once
        inputs = self.tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
        return inputs.to(self.device)

    def generate_batch(self, texts, max_length=50):
        """Generate completions for multiple texts."""
        # Preprocess batch
        inputs = self.preprocess_batch(texts)

        # Generate completions
        with torch.no_grad():
            outputs = self.model.generate(
                inputs["input_ids"],
                attention_mask=inputs["attention_mask"],
                max_length=max_length,
                num_return_sequences=1,
                pad_token_id=self.tokenizer.eos_token_id,
                do_sample=True,
                temperature=0.7
            )

        # Decode completions
        completions = self.tokenizer.batch_decode(outputs, skip_special_tokens=True)

        # Extract new text
        results = []
        for text, completion in zip(texts, completions):
            results.append(completion[len(text):])

        return results

# Example: Optimized batch completion
optimized_complete = OptimizedAutoComplete()
texts = [
    "Machine learning is",
    "Deep neural networks can",
    "The training process involves"
]
completions = optimized_complete.generate_batch(texts)
for text, completion in zip(texts, completions):
    print(f"\nInput: {text}")
    print(f"Completion: {completion}")

100

101

102

103

104

105

106

from functools import lru_cache

from transformers import GPT2LMHeadModel, GPT2Tokenizer

import torch

class AutoComplete:

def __init__(self, model_name="gpt2"):

"""Initialize the auto-complete system."""

self.tokenizer = GPT2Tokenizer.from_pretrained(model_name, padding_side="left")

self.model = GPT2LMHeadModel.from_pretrained(model_name)

self.device = "cuda" if torch.cuda.is_available() else "cpu"

self.model.to(self.device)

self.model.eval() # Set to evaluation mode

def get_completion(self, text, max_length=50):

"""Generate completion for the input text."""

print("**** Completion:", text)

# Encode the input text

inputs = self.tokenizer(text, add_special_tokens=False, return_tensors="pt")

input_ids = inputs["input_ids"].to(self.device)

attn_masks = inputs["attention_mask"].to(self.device)

# Generate completion

with torch.no_grad():

outputs = self.model.generate(

input_ids,

attention_mask=attn_masks,

max_length=max_length,

num_return_sequences=1,

pad_token_id=self.tokenizer.eos_token_id,

do_sample=True,

temperature=0.7

)

# Decode and extract completion

full_text = self.tokenizer.decode(outputs[0], skip_special_tokens=True)

completion = full_text[len(text):]

return completion

class CachedAutoComplete(AutoComplete):

def __init__(self, cache_size=1000, **kwargs):

"""Initialize with caching support."""

super().__init__(**kwargs)

self.get_completion = lru_cache(maxsize=cache_size)(

self.get_completion

)

class OptimizedAutoComplete(CachedAutoComplete):

def __init__(self, **kwargs):

"""Initialize with optimizations."""

super().__init__(**kwargs)

self.tokenizer.pad_token = self.tokenizer.eos_token

if self.device == "cuda":

self.model = self.model.half() # Use FP16 on GPU

# use eval mode and cuda graphs

self.model.eval()

def preprocess_batch(self, texts):

"""Efficiently process multiple texts."""

# Tokenize all texts at once

inputs = self.tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

return inputs.to(self.device)

def generate_batch(self, texts, max_length=50):

"""Generate completions for multiple texts."""

# Preprocess batch

inputs = self.preprocess_batch(texts)

# Generate completions

with torch.no_grad():

outputs = self.model.generate(

inputs["input_ids"],

attention_mask=inputs["attention_mask"],

max_length=max_length,

num_return_sequences=1,

pad_token_id=self.tokenizer.eos_token_id,

do_sample=True,

temperature=0.7

)

# Decode completions

completions = self.tokenizer.batch_decode(outputs, skip_special_tokens=True)

# Extract new text

results = []

for text, completion in zip(texts, completions):

results.append(completion[len(text):])

return results

# Example: Optimized batch completion

optimized_complete = OptimizedAutoComplete()

texts = [

"Machine learning is",

"Deep neural networks can",

"The training process involves"

]

completions = optimized_complete.generate_batch(texts)

for text, completion in zip(texts, completions):

print(f"\nInput: {text}")

print(f"Completion: {completion}")

Summary

In this tutorial, you discovered how to build an intelligent auto-complete system using GPT-2. Specifically, you learned:

The theory behind neural auto-completion systems
How to implement basic auto-completion
How to add caching for better performance
How to make context-aware suggestions
How to optimize for real-time use

The code examples are production-ready and can be used to build auto-complete applications.

Navigation

Auto-Completion Style Text Generation with GPT-2 Model

Overview

Traditional vs Neural Approaches

Auto-Complete Architecture

Basic Auto-Complete Implementation

Caching and Batched Input

Summary

Want to Use Powerful Language Models in Your NLP Projects?

Run State-of-the-Art Models on Your Own Machine

Finally Bring Advanced NLP to
Your Own Projects

More On This Topic

No comments yet.

Leave a Reply Click here to cancel reply.

Navigation

Overview

Traditional vs Neural Approaches

Auto-Complete Architecture

Basic Auto-Complete Implementation

Caching and Batched Input

Summary

Want to Use Powerful Language Models in Your NLP Projects?

Run State-of-the-Art Models on Your Own Machine

Finally Bring Advanced NLP to Your Own Projects

More On This Topic

No comments yet.

Leave a Reply Click here to cancel reply.

Finally Bring Advanced NLP to
Your Own Projects