Applications with Context Vectors

By Muhammad Asad Iqbal Khan on May 15, 2025 in Hugging Face Transformers 0

Context vectors are a powerful tool for advanced NLP tasks. They allow you to capture the contextual meaning of words, such as identifying the correct sense of a word in a sentence when it has multiple meanings. In this post, we will explore some example applications of context vectors. Specifically:

You will learn how to extract contextual keywords from a document
You will learn how to generate a summary of a document using context vectors

Kick-start your project with my book NLP with Hugging Face Transformers. It provides self-study tutorials with working code.

Let’s get started.

Applications with Context Vectors
Photo by Erik Karits. Some rights reserved.

Overview

This post is divided into two parts; they are:

Contextual Keyword Extraction
Contextual Text Summarization

Contextual Keyword Extraction

Contextual keyword extraction is a technique for identifying the most important words in a document based on their contextual relevance. Imagine that you have a document and want to highlight the most representative words. One way to do this is by finding the words that are most semantically similar to the document. This technique is useful for a wide range of NLP tasks, such as information retrieval, document clustering, and text summarization.

Let’s implement a simple contextual keyword extraction system by comparing each word in the document to the document as a whole:

import numpy as np
import torch
from transformers import BertTokenizer, BertModel

def get_context_vectors(sentence, model, tokenizer):
    inputs = tokenizer(sentence, return_tensors="pt", add_special_tokens=True)
    input_ids = inputs["input_ids"]
    attention_mask = inputs["attention_mask"]

    # Get the tokens (for reference)
    tokens = tokenizer.convert_ids_to_tokens(input_ids[0])

    # Forward pass, get all hidden states from each layer
    with torch.no_grad():
        outputs = model(input_ids, attention_mask=attention_mask, output_hidden_states=True)
    hidden_states = outputs.hidden_states

    # Each element in hidden states has shape (batch_size, sequence_length, hidden_size)
    # Here takes the first element in the batch from the last layer
    last_layer_vectors = hidden_states[-1][0].numpy()  # Shape: (sequence_length, hidden_size)

    return tokens, last_layer_vectors

def cosine_similarity(vec1, vec2):
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

def extract_contextual_keywords(document, model, tokenizer, top_n=5):
    """extract contextual keywords from a document"""
    # Split the document into sentences (simple split by period)
    sentences = [s.strip() for s in document.split(".") if s.strip()]

    # Process each sentence to get context vectors
    all_tokens = []
    all_vectors = []
    for sentence in sentences:
        if not sentence:
            continue   # Skip empty sentences

        # Get context vectors
        tokens, vectors = get_context_vectors(sentence, model, tokenizer)

        # Store tokens and vectors (excluding special tokens [CLS] and [SEP])
        all_tokens.extend(tokens[1:-1])
        all_vectors.extend(vectors[1:-1])

    # Convert to numpy arrays, then calculate the document vector as average of all token vectors
    all_vectors = np.array(all_vectors)
    doc_vector = np.mean(all_vectors, axis=0)

    # Calculate similarity between each token vector and the document vector
    similarities = []
    for token, vec in zip(all_tokens, all_vectors):
        # Skip special tokens, punctuation, and common words
        if token in ["[CLS]", "[SEP]", ".", ",", "!", "?", "the", "a", "an", "is", "are", "was", "were"]:
            continue
        # compute similarity, then remember it with the token
        sim = cosine_similarity(vec, doc_vector)
        similarities.append((sim, token))

    # Sort the similarity and get the top N
    top_similarities = sorted(similarities, reverse=True)[:top_n]
    return top_similarities

# Example document
document = """
Artificial intelligence is transforming industries around the world.
Machine learning algorithms can analyze vast amounts of data to identify patterns and make predictions.
Natural language processing enables computers to understand and generate human language.
Computer vision systems can recognize objects and interpret visual information.
These technologies are driving innovation in healthcare, finance, transportation, and many other sectors.
"""

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")
model.eval()

# Extract contextual keywords and print the result
top_keywords = extract_contextual_keywords(document, model, tokenizer, top_n=10)
print("Top contextual keywords:")
for similarity, token in top_keywords:
    print(f"{token}: {similarity:.4f}")

import numpy as np

import torch

from transformers import BertTokenizer, BertModel

def get_context_vectors(sentence, model, tokenizer):

inputs = tokenizer(sentence, return_tensors="pt", add_special_tokens=True)

input_ids = inputs["input_ids"]

attention_mask = inputs["attention_mask"]

# Get the tokens (for reference)

tokens = tokenizer.convert_ids_to_tokens(input_ids[0])

# Forward pass, get all hidden states from each layer

with torch.no_grad():

outputs = model(input_ids, attention_mask=attention_mask, output_hidden_states=True)

hidden_states = outputs.hidden_states

# Each element in hidden states has shape (batch_size, sequence_length, hidden_size)

# Here takes the first element in the batch from the last layer

last_layer_vectors = hidden_states[-1][0].numpy() # Shape: (sequence_length, hidden_size)

return tokens, last_layer_vectors

def cosine_similarity(vec1, vec2):

return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

def extract_contextual_keywords(document, model, tokenizer, top_n=5):

"""extract contextual keywords from a document"""

# Split the document into sentences (simple split by period)

sentences = [s.strip() for s in document.split(".") if s.strip()]

# Process each sentence to get context vectors

all_tokens = []

all_vectors = []

for sentence in sentences:

if not sentence:

continue # Skip empty sentences

# Get context vectors

tokens, vectors = get_context_vectors(sentence, model, tokenizer)

# Store tokens and vectors (excluding special tokens [CLS] and [SEP])

all_tokens.extend(tokens[1:-1])

all_vectors.extend(vectors[1:-1])

# Convert to numpy arrays, then calculate the document vector as average of all token vectors

all_vectors = np.array(all_vectors)

doc_vector = np.mean(all_vectors, axis=0)

# Calculate similarity between each token vector and the document vector

similarities = []

for token, vec in zip(all_tokens, all_vectors):

# Skip special tokens, punctuation, and common words

if token in ["[CLS]", "[SEP]", ".", ",", "!", "?", "the", "a", "an", "is", "are", "was", "were"]:

continue

# compute similarity, then remember it with the token

sim = cosine_similarity(vec, doc_vector)

similarities.append((sim, token))

# Sort the similarity and get the top N

top_similarities = sorted(similarities, reverse=True)[:top_n]

return top_similarities

# Example document

document = """

Artificial intelligence is transforming industries around the world.

Machine learning algorithms can analyze vast amounts of data to identify patterns and make predictions.

Natural language processing enables computers to understand and generate human language.

Computer vision systems can recognize objects and interpret visual information.

These technologies are driving innovation in healthcare, finance, transportation, and many other sectors.

"""

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

model = BertModel.from_pretrained("bert-base-uncased")

model.eval()

# Extract contextual keywords and print the result

top_keywords = extract_contextual_keywords(document, model, tokenizer, top_n=10)

print("Top contextual keywords:")

for similarity, token in top_keywords:

print(f"{token}: {similarity:.4f}")

In this example, the BERT model is used to generate context vectors for each word in the document. The document vector is computed as the average of all token vectors. Alternatively, you could obtain the document vector by extracting the [CLS] prefix token after feeding the entire document into the model. However, this approach is not used here because the input document may be too long for the model to process at once. Instead, the document is split into sentences, and each sentence is processed separately.

With the vectors for each word and the document, you compute the cosine similarity between each word and the document. The function extract_contextual_keywords() returns the top N words with the highest similarity scores. These results are then printed.

Cosine similarity measures how close two vectors are to each other. In this case, if a word vector is close to the document vector, it is assumed to be a good representative of the document. This works because the word vectors are context-aware, as generated by the transformer model. Unlike traditional keyword extraction methods that rely on frequency (such as TF-IDF) or predefined rules (such as RAKE), this approach leverages the semantic understanding captured by the transformer model.

When you run this code, you will get:

Top contextual keywords:
to: 0.7961
can: 0.7909
can: 0.7804
of: 0.7551
human: 0.7365
analyze: 0.7354
enables: 0.7345
computers: 0.7310
in: 0.7282
systems: 0.7153

Top contextual keywords:

to: 0.7961

can: 0.7909

can: 0.7804

of: 0.7551

human: 0.7365

analyze: 0.7354

enables: 0.7345

computers: 0.7310

in: 0.7282

systems: 0.7153

To improve the result, you may consider implementing stop word removal to exclude common words such as “to” in the output.

Contextual Text Summarization

Summarizing a document can be done in different ways. One of the most common approaches is to select the most representative sentences from the document—a method known as extractive summarization.

One way to perform extractive summarization is by generating a vector for each sentence and a vector for the entire document. The sentences most similar to the document are then selected. With context vectors, it is straightforward to implement this approach. Let’s do this:

import numpy as np
import torch
from transformers import BertTokenizer, BertModel

def cosine_similarity(vec1, vec2):
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

def get_sentence_embedding(sentence, model, tokenizer):
    """Sentence embedding extracted from the [CLS] prefix token"""
    # Tokenize the input
    inputs = tokenizer(sentence, return_tensors="pt",
                       add_special_tokens=True, truncation=True, max_length=512)

    # Forward pass, get hidden states
    with torch.no_grad():
        outputs = model(**inputs)

    # Get the [CLS] token embedding at position 0 from the last layer
    cls_embedding = outputs.last_hidden_state[0, 0].numpy()
    return cls_embedding

def extractive_summarize(document, model, tokenizer, num_sentences=3):
    # Split the document into sentences
    sentences = [s.strip() for s in document.split(".") if s.strip()]
    if len(sentences) <= num_sentences:
        return document

    # Get embeddings for all sentences
    sentence_embeddings = []
    for sentence in sentences:
        embedding = get_sentence_embedding(sentence, model, tokenizer)
        sentence_embeddings.append(embedding)

    # Calculate the document embedding (average of all sentence embeddings)
    # then find the most similar sentences
    document_embedding = np.mean(sentence_embeddings, axis=0)
    similarities = []
    for idx, embedding in enumerate(sentence_embeddings):
        sim = cosine_similarity(embedding, document_embedding)
        similarities.append((sim, idx))
    top_sentences = sorted(similarities, reverse=True)[:num_sentences]

    # Extract the sentences, preserve the original order
    top_indices = sorted([x[1] for x in top_sentences])
    summary_sentences = [sentences[i] for i in top_indices]

    # Join the sentences to form the summary
    summary = ". ".join(summary_sentences) + "."
    return summary

# Example document
document = """
Transformer models have revolutionized natural language processing by
introducing mechanisms that can effectively capture contextual relationships in
text. One of the most powerful aspects of transformers is their ability to
generate context-aware vector representations, often referred to as context
vectors. Unlike traditional word embeddings that assign a fixed vector to each
word regardless of context, transformer models generate dynamic representations
that depend on the surrounding words. This allows them to capture the nuanced
meanings of words in different contexts. For example, in the sentences "I'm
going to the bank to deposit money" and "I'm going to sit by the river bank,"
the word "bank" has different meanings. A traditional word embedding would
assign the same vector to "bank" in both sentences, but a transformer model
generates different context vectors that capture the distinct meanings based on
the surrounding words. This contextual understanding enables transformers to
excel at a wide range of NLP tasks, from question answering and sentiment
analysis to machine translation and text summarization.
"""

# Generate a summary
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")
summary = extractive_summarize(document, model, tokenizer, num_sentences=3)

# Print the original document and the summary
print("Original Document:")
print(document)
print("Summary:")
print(summary)

import numpy as np

import torch

from transformers import BertTokenizer, BertModel

def cosine_similarity(vec1, vec2):

return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

def get_sentence_embedding(sentence, model, tokenizer):

"""Sentence embedding extracted from the [CLS] prefix token"""

# Tokenize the input

inputs = tokenizer(sentence, return_tensors="pt",

add_special_tokens=True, truncation=True, max_length=512)

# Forward pass, get hidden states

with torch.no_grad():

outputs = model(**inputs)

# Get the [CLS] token embedding at position 0 from the last layer

cls_embedding = outputs.last_hidden_state[0, 0].numpy()

return cls_embedding

def extractive_summarize(document, model, tokenizer, num_sentences=3):

# Split the document into sentences

sentences = [s.strip() for s in document.split(".") if s.strip()]

if len(sentences) <= num_sentences:

return document

# Get embeddings for all sentences

sentence_embeddings = []

for sentence in sentences:

embedding = get_sentence_embedding(sentence, model, tokenizer)

sentence_embeddings.append(embedding)

# Calculate the document embedding (average of all sentence embeddings)

# then find the most similar sentences

document_embedding = np.mean(sentence_embeddings, axis=0)

similarities = []

for idx, embedding in enumerate(sentence_embeddings):

sim = cosine_similarity(embedding, document_embedding)

similarities.append((sim, idx))

top_sentences = sorted(similarities, reverse=True)[:num_sentences]

# Extract the sentences, preserve the original order

top_indices = sorted([x[1] for x in top_sentences])

summary_sentences = [sentences[i] for i in top_indices]

# Join the sentences to form the summary

summary = ". ".join(summary_sentences) + "."

return summary

# Example document

document = """

Transformer models have revolutionized natural language processing by

introducing mechanisms that can effectively capture contextual relationships in

text. One of the most powerful aspects of transformers is their ability to

generate context-aware vector representations, often referred to as context

vectors. Unlike traditional word embeddings that assign a fixed vector to each

word regardless of context, transformer models generate dynamic representations

that depend on the surrounding words. This allows them to capture the nuanced

meanings of words in different contexts. For example, in the sentences "I'm

going to the bank to deposit money" and "I'm going to sit by the river bank,"

the word "bank" has different meanings. A traditional word embedding would

assign the same vector to "bank" in both sentences, but a transformer model

generates different context vectors that capture the distinct meanings based on

the surrounding words. This contextual understanding enables transformers to

excel at a wide range of NLP tasks, from question answering and sentiment

analysis to machine translation and text summarization.

"""

# Generate a summary

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

model = BertModel.from_pretrained("bert-base-uncased")

summary = extractive_summarize(document, model, tokenizer, num_sentences=3)

# Print the original document and the summary

print("Original Document:")

print(document)

print("Summary:")

print(summary)

If you run this code, you will get:

Original Document:

Transformer models have revolutionized natural language processing by
introducing mechanisms that can effectively capture contextual relationships in
text. One of the most powerful aspects of transformers is their ability to
generate context-aware vector representations, often referred to as context
vectors. Unlike traditional word embeddings that assign a fixed vector to each
word regardless of context, transformer models generate dynamic representations
that depend on the surrounding words. This allows them to capture the nuanced
meanings of words in different contexts. For example, in the sentences "I'm
going to the bank to deposit money" and "I'm going to sit by the river bank,"
the word "bank" has different meanings. A traditional word embedding would
assign the same vector to "bank" in both sentences, but a transformer model
generates different context vectors that capture the distinct meanings based on
the surrounding words. This contextual understanding enables transformers to
excel at a wide range of NLP tasks, from question answering and sentiment
analysis to machine translation and text summarization.

Summary:
One of the most powerful aspects of transformers is their ability to
generate context-aware vector representations, often referred to as context
vectors. Unlike traditional word embeddings that assign a fixed vector to each
word regardless of context, transformer models generate dynamic representations
that depend on the surrounding words. A traditional word embedding would
assign the same vector to "bank" in both sentences, but a transformer model
generates different context vectors that capture the distinct meanings based on
the surrounding words.

Original Document: