Building a Graph RAG System: A Step-by-Step Approach

By Kanwal Mehreen on December 2, 2024 in Language Models 5

Building a Graph RAG System: A Step-by-Step Approach
Image by Author | Ideogram.ai

Graph RAG, Graph RAG, Graph RAG! This term has become the talk of the town, and you might have come across it as well. But what exactly is Graph RAG, and what has made it so popular? In this article, we’ll explore the concept behind Graph RAG, why it’s needed, and, as a bonus, we’ll discuss how to implement it using LlamaIndex. Let’s get started!

First, let’s address the shift from large language models (LLMs) to Retrieval-Augmented Generation (RAG) systems. LLMs rely on static knowledge, which means they only use the data they were trained on. This limitation often makes them prone to hallucinations—generating incorrect or fabricated information. To handle this, RAG systems were developed. Unlike LLMs, RAG retrieves data in real-time from external knowledge bases, using this fresh context to generate more accurate and relevant responses. These traditional RAG systems work by using text embeddings to retrieve specific information. While powerful, they come with limitations. If you’ve worked on RAG-related projects, you’ll probably relate to this: the quality of the system’s response heavily depends on the clarity and specificity of the query. But an even bigger challenge emerged — the inability to reason effectively across multiple documents.

Now, What does that mean? Let’s take an example. Imagine you’re asking the system:

“Who were the key contributors to the discovery of DNA’s double-helix structure, and what role did Rosalind Franklin play?”

In a traditional RAG setup, the system might retrieve the following pieces of information:

Document 1: “James Watson and Francis Crick proposed the double-helix structure in 1953.”
Document 2: “Rosalind Franklin’s X-ray diffraction images were critical in identifying DNA’s helical structure.”
Document 3: “Maurice Wilkins shared Franklin’s images with Watson and Crick, which contributed to their discovery.”

The problem? Traditional RAG systems treat these documents as independent units. They don’t connect the dots effectively, leading to fragmented responses like:

“Watson and Crick proposed the structure, and Franklin’s work was important.”

This response lacks depth and misses key relationships between contributors. Enter Graph RAG! By organizing the retrieved data as a graph, Graph RAG represents each document or fact as a node, and the relationships between them as edges.

Here’s how Graph RAG would handle the same query:

Nodes: Represent facts (e.g., “Watson and Crick proposed the structure,” “Franklin contributed critical X-ray images”).
Edges: Represent relationships (e.g., “Franklin’s images → shared by Wilkins → influenced Watson and Crick”).

By reasoning across these interconnected nodes, Graph RAG can produce a complete and insightful response like:

“The discovery of DNA’s double-helix structure in 1953 was primarily led by James Watson and Francis Crick. However, this breakthrough heavily relied on Rosalind Franklin’s X-ray diffraction images, which were shared with them by Maurice Wilkins.”

This ability to combine information from multiple sources and answer broader, more complex questions is what makes Graph RAG so popular.

The Graph RAG Pipeline

We’ll now explore the Graph RAG pipeline, as presented in the paper “From Local to Global: A Graph RAG Approach to Query-Focused Summarization” by Microsoft Research.

Graph RAG Approach: Microsoft Research

Step 1: Source Documents → Text Chunks

LLMs can handle only a limited amount of text at a time. To maintain accuracy and ensure that nothing important is missed, we will first break down large documents into smaller, manageable “chunks” of text for processing.

Step 2: Text Chunks → Element Instances

From each chunk of source text, we will prompt the LLMs to identify graph nodes and edges. For example, from a news article, the LLMs might detect that “NASA launched a spacecraft” and link “NASA” (entity: node) to “spacecraft” (entity: node) through “launched” (relationship: edge).

Step 3: Element Instances → Element Summaries

After identifying the elements, the next step is to summarize them into concise, meaningful descriptions using LLMs. This process makes the data easier to understand. For example, for the node “NASA,” the summary could be: “NASA is a space agency responsible for space exploration missions.” For the edge connecting “NASA” and “spacecraft,” the summary might be: “NASA launched the spacecraft in 2023.” These summaries ensure the graph is both rich in detail and easy to interpret.

Step 4: Element Summaries → Graph Communities

The graph created in the previous steps is often too large to analyze directly. To simplify it, the graph is divided into communities using specialized algorithms like Leiden. These communities help identify clusters of closely related information. For example, one community might focus on “Space Exploration,” grouping nodes such as “NASA,” “Spacecraft,” and “Mars Rover.” Another might focus on “Environmental Science,” grouping nodes like “Climate Change,” “Carbon Emissions,” and “Sea Levels.” This step makes it easier to identify themes and connections within the dataset.

Step 5: Graph Communities → Community Summaries

LLMs prioritize important details and fit them into a manageable size. Therefore, each community is summarized to give an overview of the information it contains. For example: A community about “space exploration” might summarize key missions, discoveries, and organizations like NASA or SpaceX. These summaries are useful for answering general questions or exploring broad topics within the dataset.

Step 6: Community Summaries → Community Answers → Global Answer

Finally, the community summaries are used to answer user queries. Here’s how:

Query the Data: A user asks, “What are the main impacts of climate change?”
Community Analysis: The AI reviews summaries from relevant communities.
Generate Partial Answers: Each community provides partial answers, such as:
- “Rising sea levels threaten coastal cities.”
- “Disrupted agriculture due to unpredictable weather.”
Combine into a Global Answer: These partial answers are combined into one comprehensive response:

“Climate change impacts include rising sea levels, disrupted agriculture, and an increased frequency of natural disasters.”

This process ensures the final answer is detailed, accurate, and easy to understand.

Step-by-Step Implementation of GraphRAG with LlamaIndex

You can build your custom Python implementation or use frameworks like LangChain or LlamaIndex. For this article, we will use the LlamaIndex baseline code provided on their website; however, I will explain it in a beginner-friendly manner. Additionally, I encountered a parsing problem with the original code, which I will explain later along with how I solved it.

Step 1: Install Dependencies

Install the required libraries for the pipeline:

pip install llama-index graspologic numpy==1.24.4 scipy==1.12.0

1	pip install llama-index graspologic numpy==1.24.4 scipy==1.12.0

graspologic: Used for graph algorithms like Hierarchical Leiden for community detection.

Step 2: Load and Preprocess Data

Load sample news data, which will be chunked into smaller parts for easier processing. For demonstration, we limit it to 50 samples. Each row (title and text) is converted into a Document object.

import pandas as pd
from llama_index.core import Document

# Load sample dataset
news = pd.read_csv("https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/news_articles.csv")[:50]

# Convert data into LlamaIndex Document objects
documents = [
    Document(text=f"{row['title']}: {row['text']}")
    for _, row in news.iterrows()
]

import pandas as pd

from llama_index.core import Document

# Load sample dataset

news = pd.read_csv("https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/news_articles.csv")[:50]

# Convert data into LlamaIndex Document objects

documents = [

Document(text=f"{row['title']}: {row['text']}")

for _, row in news.iterrows()

]

Step 3: Split Text into Nodes

Use SentenceSplitter to break down documents into manageable chunks.

from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(
    chunk_size=1024,
    chunk_overlap=20,
)
nodes = splitter.get_nodes_from_documents(documents)

from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(

chunk_size=1024,

chunk_overlap=20,

)

nodes = splitter.get_nodes_from_documents(documents)

chunk_overlap=20: Ensures chunks overlap slightly to avoid missing information at the boundaries

Step 4: Configure the LLM, Prompt, and GraphRAG Extractor

Set up the LLM (e.g., GPT-4). This LLM will later analyze the chunks to extract entities and relationships.

from llama_index.llms.openai import OpenAI

os.environ["OPENAI_API_KEY"] = "your_openai_api_key"
llm = OpenAI(model="gpt-4")

from llama_index.llms.openai import OpenAI

os.environ["OPENAI_API_KEY"] = "your_openai_api_key"

llm = OpenAI(model="gpt-4")

The GraphRAGExtractor uses the above LLM, a prompt template to guide the extraction process, and a parsing function to process the LLM’s output into structured data. Text chunks (called nodes) are fed into the extractor. For each chunk, the extractor sends the text to the LLM along with the prompt, which instructs the LLM to identify entities, their types, and their relationships. The response is parsed by a function (parse_fn), which extracts the entities and relationships. These are then converted into EntityNode objects (for entities) and Relation objects (for relationships), with descriptions stored as metadata. The extracted entities and relationships are saved into the text chunk’s metadata, ready for use in building knowledge graphs or performing queries.

Note: The issue in the original implementation was that the parse_fn failed to extract entities and relationships from the LLM-generated response, resulting in empty outputs for parsed entities and relationships. This occurred due to overly complex and rigid regular expressions that did not align well with the LLM response’s actual structure, particularly regarding inconsistent formatting and line breaks in the output. To address this, I have simplified the parse_fn by replacing the original regex patterns with straightforward patterns designed to match the key-value structure of the LLM response more reliably. The updated part looks like this:

entity_pattern = r'entity_name:\s*(.+?)\s*entity_type:\s*(.+?)\s*entity_description:\s*(.+?)\s*'
relationship_pattern = r'source_entity:\s*(.+?)\s*target_entity:\s*(.+?)\s*relation:\s*(.+?)\s*relationship_description:\s*(.+?)\s*'

def parse_fn(response_str: str) -> Any:
    entities = re.findall(entity_pattern, response_str)
    relationships = re.findall(relationship_pattern, response_str)
    return entities, relationships

entity_pattern = r'entity_name:\s*(.+?)\s*entity_type:\s*(.+?)\s*entity_description:\s*(.+?)\s*'

relationship_pattern = r'source_entity:\s*(.+?)\s*target_entity:\s*(.+?)\s*relation:\s*(.+?)\s*relationship_description:\s*(.+?)\s*'

def parse_fn(response_str: str) -> Any:

entities = re.findall(entity_pattern, response_str)

relationships = re.findall(relationship_pattern, response_str)

return entities, relationships

The prompt template and GraphRAGExtractor class are kept as is, as follows:

import asyncio
import nest_asyncio

nest_asyncio.apply()

from typing import Any, List, Callable, Optional, Union, Dict
from IPython.display import Markdown, display

from llama_index.core.async_utils import run_jobs
from llama_index.core.indices.property_graph.utils import (
    default_parse_triplets_fn,
)
from llama_index.core.graph_stores.types import (
    EntityNode,
    KG_NODES_KEY,
    KG_RELATIONS_KEY,
    Relation,
)
from llama_index.core.llms.llm import LLM
from llama_index.core.prompts import PromptTemplate
from llama_index.core.prompts.default_prompts import (
    DEFAULT_KG_TRIPLET_EXTRACT_PROMPT,
)
from llama_index.core.schema import TransformComponent, BaseNode
from llama_index.core.bridge.pydantic import BaseModel, Field
class GraphRAGExtractor(TransformComponent):
    """Extract triples from a graph.

    Uses an LLM and a simple prompt + output parsing to extract paths (i.e. triples) and entity, relation descriptions from text.

    Args:
        llm (LLM):
            The language model to use.
        extract_prompt (Union[str, PromptTemplate]):
            The prompt to use for extracting triples.
        parse_fn (callable):
            A function to parse the output of the language model.
        num_workers (int):
            The number of workers to use for parallel processing.
        max_paths_per_chunk (int):
            The maximum number of paths to extract per chunk.
    """

    llm: LLM
    extract_prompt: PromptTemplate
    parse_fn: Callable
    num_workers: int
    max_paths_per_chunk: int

    def __init__(
        self,
        llm: Optional[LLM] = None,
        extract_prompt: Optional[Union[str, PromptTemplate]] = None,
        parse_fn: Callable = default_parse_triplets_fn,
        max_paths_per_chunk: int = 10,
        num_workers: int = 4,
    ) -> None:
        """Init params."""
        from llama_index.core import Settings

        if isinstance(extract_prompt, str):
            extract_prompt = PromptTemplate(extract_prompt)

        super().__init__(
            llm=llm or Settings.llm,
            extract_prompt=extract_prompt or DEFAULT_KG_TRIPLET_EXTRACT_PROMPT,
            parse_fn=parse_fn,
            num_workers=num_workers,
            max_paths_per_chunk=max_paths_per_chunk,
        )

    @classmethod
    def class_name(cls) -> str:
        return "GraphExtractor"

    def __call__(
        self, nodes: List[BaseNode], show_progress: bool = False, **kwargs: Any
    ) -> List[BaseNode]:
        """Extract triples from nodes."""
        return asyncio.run(
            self.acall(nodes, show_progress=show_progress, **kwargs)
        )

    async def _aextract(self, node: BaseNode) -> BaseNode:
        """Extract triples from a node."""
        assert hasattr(node, "text")

        text = node.get_content(metadata_mode="llm")
        try:
            llm_response = await self.llm.apredict(
                self.extract_prompt,
                text=text,
                max_knowledge_triplets=self.max_paths_per_chunk,
            )
            entities, entities_relationship = self.parse_fn(llm_response)
        except ValueError:
            entities = []
            entities_relationship = []

        existing_nodes = node.metadata.pop(KG_NODES_KEY, [])
        existing_relations = node.metadata.pop(KG_RELATIONS_KEY, [])
        metadata = node.metadata.copy()
        for entity, entity_type, description in entities:
            metadata[
                "entity_description"
            ] = description  # Not used in the current implementation. But will be useful in future work.
            entity_node = EntityNode(
                name=entity, label=entity_type, properties=metadata
            )
            existing_nodes.append(entity_node)

        metadata = node.metadata.copy()
        for triple in entities_relationship:
            subj, rel, obj, description = triple
            subj_node = EntityNode(name=subj, properties=metadata)
            obj_node = EntityNode(name=obj, properties=metadata)
            metadata["relationship_description"] = description
            rel_node = Relation(
                label=rel,
                source_id=subj_node.id,
                target_id=obj_node.id,
                properties=metadata,
            )

            existing_nodes.extend([subj_node, obj_node])
            existing_relations.append(rel_node)

        node.metadata[KG_NODES_KEY] = existing_nodes
        node.metadata[KG_RELATIONS_KEY] = existing_relations
        return node

    async def acall(
        self, nodes: List[BaseNode], show_progress: bool = False, **kwargs: Any
    ) -> List[BaseNode]:
        """Extract triples from nodes async."""
        jobs = []
        for node in nodes:
            jobs.append(self._aextract(node))

        return await run_jobs(
            jobs,
            workers=self.num_workers,
            show_progress=show_progress,
            desc="Extracting paths from text",
        )

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

import asyncio

import nest_asyncio

nest_asyncio.apply()

from typing import Any, List, Callable, Optional, Union, Dict

from IPython.display import Markdown, display

from llama_index.core.async_utils import run_jobs

from llama_index.core.indices.property_graph.utils import (

default_parse_triplets_fn,

)

from llama_index.core.graph_stores.types import (

EntityNode,

KG_NODES_KEY,

KG_RELATIONS_KEY,

Relation,

)

from llama_index.core.llms.llm import LLM

from llama_index.core.prompts import PromptTemplate

from llama_index.core.prompts.default_prompts import (

DEFAULT_KG_TRIPLET_EXTRACT_PROMPT,

)

from llama_index.core.schema import TransformComponent, BaseNode

from llama_index.core.bridge.pydantic import BaseModel, Field

class GraphRAGExtractor(TransformComponent):

"""Extract triples from a graph.

Uses an LLM and a simple prompt + output parsing to extract paths (i.e. triples) and entity, relation descriptions from text.

Args:

llm (LLM):

The language model to use.

extract_prompt (Union[str, PromptTemplate]):

The prompt to use for extracting triples.

parse_fn (callable):

A function to parse the output of the language model.

num_workers (int):

The number of workers to use for parallel processing.

max_paths_per_chunk (int):

The maximum number of paths to extract per chunk.

"""

llm: LLM

extract_prompt: PromptTemplate

parse_fn: Callable

num_workers: int

max_paths_per_chunk: int

def __init__(

self,

llm: Optional[LLM] = None,

extract_prompt: Optional[Union[str, PromptTemplate]] = None,

parse_fn: Callable = default_parse_triplets_fn,

max_paths_per_chunk: int = 10,

num_workers: int = 4,

) -> None:

"""Init params."""

from llama_index.core import Settings

if isinstance(extract_prompt, str):

extract_prompt = PromptTemplate(extract_prompt)

super().__init__(

llm=llm or Settings.llm,

extract_prompt=extract_prompt or DEFAULT_KG_TRIPLET_EXTRACT_PROMPT,

parse_fn=parse_fn,

num_workers=num_workers,

max_paths_per_chunk=max_paths_per_chunk,

)

@classmethod

def class_name(cls) -> str:

return "GraphExtractor"

def __call__(

self, nodes: List[BaseNode], show_progress: bool = False, **kwargs: Any

) -> List[BaseNode]:

"""Extract triples from nodes."""

return asyncio.run(

self.acall(nodes, show_progress=show_progress, **kwargs)

)

async def _aextract(self, node: BaseNode) -> BaseNode:

"""Extract triples from a node."""

assert hasattr(node, "text")

text = node.get_content(metadata_mode="llm")

try:

llm_response = await self.llm.apredict(

self.extract_prompt,

text=text,

max_knowledge_triplets=self.max_paths_per_chunk,

)

entities, entities_relationship = self.parse_fn(llm_response)

except ValueError:

entities = []

entities_relationship = []

existing_nodes = node.metadata.pop(KG_NODES_KEY, [])

existing_relations = node.metadata.pop(KG_RELATIONS_KEY, [])

metadata = node.metadata.copy()

for entity, entity_type, description in entities:

metadata[

"entity_description"

] = description # Not used in the current implementation. But will be useful in future work.

entity_node = EntityNode(

name=entity, label=entity_type, properties=metadata

)

existing_nodes.append(entity_node)

metadata = node.metadata.copy()

for triple in entities_relationship:

subj, rel, obj, description = triple

subj_node = EntityNode(name=subj, properties=metadata)

obj_node = EntityNode(name=obj, properties=metadata)

metadata["relationship_description"] = description

rel_node = Relation(

label=rel,

source_id=subj_node.id,

target_id=obj_node.id,

properties=metadata,

)

existing_nodes.extend([subj_node, obj_node])

existing_relations.append(rel_node)

node.metadata[KG_NODES_KEY] = existing_nodes

node.metadata[KG_RELATIONS_KEY] = existing_relations

return node

async def acall(

self, nodes: List[BaseNode], show_progress: bool = False, **kwargs: Any

) -> List[BaseNode]:

"""Extract triples from nodes async."""

jobs = []

for node in nodes:

jobs.append(self._aextract(node))

return await run_jobs(

jobs,

workers=self.num_workers,

show_progress=show_progress,

desc="Extracting paths from text",

)

<strong>KG_TRIPLET_EXTRACT_TMPL</strong> = """
-Goal-
Given a text document, identify all entities and their entity types from the text and all relationships among the identified entities.
Given the text, extract up to {max_knowledge_triplets} entity-relation triplets.

-Steps-
1. Identify all entities. For each identified entity, extract the following information:
- entity_name: Name of the entity, capitalized
- entity_type: Type of the entity
- entity_description: Comprehensive description of the entity's attributes and activities
Format each entity as ("entity")

2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are *clearly related* to each other.
For each pair of related entities, extract the following information:
- source_entity: name of the source entity, as identified in step 1
- target_entity: name of the target entity, as identified in step 1
- relation: relationship between source_entity and target_entity
- relationship_description: explanation as to why you think the source entity and the target entity are related to each other

Format each relationship as ("relationship")

3. When finished, output.

-Real Data-
######################
text: {text}
######################
output:"""

<strong>KG_TRIPLET_EXTRACT_TMPL</strong> = """

-Goal-

Given a text document, identify all entities and their entity types from the text and all relationships among the identified entities.

Given the text, extract up to {max_knowledge_triplets} entity-relation triplets.

-Steps-

1. Identify all entities. For each identified entity, extract the following information:

- entity_name: Name of the entity, capitalized

- entity_type: Type of the entity

- entity_description: Comprehensive description of the entity's attributes and activities

Format each entity as ("entity")

2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are *clearly related* to each other.

For each pair of related entities, extract the following information:

- source_entity: name of the source entity, as identified in step 1

- target_entity: name of the target entity, as identified in step 1

- relation: relationship between source_entity and target_entity

- relationship_description: explanation as to why you think the source entity and the target entity are related to each other

Format each relationship as ("relationship")

3. When finished, output.

-Real Data-

######################

text: {text}

######################

output:"""

kg_extractor = GraphRAGExtractor(
    llm=llm,
    extract_prompt=KG_TRIPLET_EXTRACT_TMPL,
    max_paths_per_chunk=2,
    parse_fn=parse_fn,
)

kg_extractor = GraphRAGExtractor(

llm=llm,

extract_prompt=KG_TRIPLET_EXTRACT_TMPL,

max_paths_per_chunk=2,

parse_fn=parse_fn,

)

Step 5: Build the Graph Index

The PropertyGraphIndex extracts entities and relationships from text using kg_extractor and stores them as nodes and edges in the GraphRAGStore.

import re
from llama_index.core.graph_stores import SimplePropertyGraphStore
import networkx as nx
from graspologic.partition import hierarchical_leiden

from llama_index.core.llms import ChatMessage
class GraphRAGStore(SimplePropertyGraphStore):
    community_summary = {}
    max_cluster_size = 5

    def generate_community_summary(self, text):
        """Generate summary for a given text using an LLM."""
        messages = [
            ChatMessage(
                role="system",
                content=(
                    "You are provided with a set of relationships from a knowledge graph, each represented as "
                    "entity1->entity2->relation->relationship_description. Your task is to create a summary of these "
                    "relationships. The summary should include the names of the entities involved and a concise synthesis "
                    "of the relationship descriptions. The goal is to capture the most critical and relevant details that "
                    "highlight the nature and significance of each relationship. Ensure that the summary is coherent and "
                    "integrates the information in a way that emphasizes the key aspects of the relationships."
                ),
            ),
            ChatMessage(role="user", content=text),
        ]
        response = OpenAI().chat(messages)
        clean_response = re.sub(r"^assistant:\s*", "", str(response)).strip()
        return clean_response

    def build_communities(self):
        """Builds communities from the graph and summarizes them."""
        nx_graph = self._create_nx_graph()
        community_hierarchical_clusters = hierarchical_leiden(
            nx_graph, max_cluster_size=self.max_cluster_size
        )
        community_info = self._collect_community_info(
            nx_graph, community_hierarchical_clusters
        )
        self._summarize_communities(community_info)

    def _create_nx_graph(self):
        """Converts internal graph representation to NetworkX graph."""
        nx_graph = nx.Graph()
        for node in self.graph.nodes.values():
            nx_graph.add_node(str(node))
        for relation in self.graph.relations.values():
            nx_graph.add_edge(
                relation.source_id,
                relation.target_id,
                relationship=relation.label,
                description=relation.properties["relationship_description"],
            )
        return nx_graph

    def _collect_community_info(self, nx_graph, clusters):
        """Collect detailed information for each node based on their community."""
        community_mapping = {item.node: item.cluster for item in clusters}
        community_info = {}
        for item in clusters:
            cluster_id = item.cluster
            node = item.node
            if cluster_id not in community_info:
                community_info[cluster_id] = []

            for neighbor in nx_graph.neighbors(node):
                if community_mapping[neighbor] == cluster_id:
                    edge_data = nx_graph.get_edge_data(node, neighbor)
                    if edge_data:
                        detail = f"{node} -> {neighbor} -> {edge_data['relationship']} -> {edge_data['description']}"
                        community_info[cluster_id].append(detail)
        return community_info

    def _summarize_communities(self, community_info):
        """Generate and store summaries for each community."""
        for community_id, details in community_info.items():
            details_text = (
                "\n".join(details) + "."
            )  # Ensure it ends with a period
            self.community_summary[
                community_id
            ] = self.generate_community_summary(details_text)

    def get_community_summaries(self):
        """Returns the community summaries, building them if not already done."""
        if not self.community_summary:
            self.build_communities()
        return self.community_summary

import re

from llama_index.core.graph_stores import SimplePropertyGraphStore

import networkx as nx

from graspologic.partition import hierarchical_leiden

from llama_index.core.llms import ChatMessage

class GraphRAGStore(SimplePropertyGraphStore):

community_summary = {}

max_cluster_size = 5

def generate_community_summary(self, text):

"""Generate summary for a given text using an LLM."""

messages = [

ChatMessage(

role="system",

content=(

"You are provided with a set of relationships from a knowledge graph, each represented as "

"entity1->entity2->relation->relationship_description. Your task is to create a summary of these "

"relationships. The summary should include the names of the entities involved and a concise synthesis "

"of the relationship descriptions. The goal is to capture the most critical and relevant details that "

"highlight the nature and significance of each relationship. Ensure that the summary is coherent and "

"integrates the information in a way that emphasizes the key aspects of the relationships."

ChatMessage(role="user", content=text),

]

response = OpenAI().chat(messages)

clean_response = re.sub(r"^assistant:\s*", "", str(response)).strip()

return clean_response

def build_communities(self):

"""Builds communities from the graph and summarizes them."""

nx_graph = self._create_nx_graph()

community_hierarchical_clusters = hierarchical_leiden(

nx_graph, max_cluster_size=self.max_cluster_size

)

community_info = self._collect_community_info(

nx_graph, community_hierarchical_clusters

)

self._summarize_communities(community_info)

def _create_nx_graph(self):

"""Converts internal graph representation to NetworkX graph."""

nx_graph = nx.Graph()

for node in self.graph.nodes.values():

nx_graph.add_node(str(node))

for relation in self.graph.relations.values():

nx_graph.add_edge(

relation.source_id,

relation.target_id,

relationship=relation.label,

description=relation.properties["relationship_description"],

)

return nx_graph

def _collect_community_info(self, nx_graph, clusters):

"""Collect detailed information for each node based on their community."""

community_mapping = {item.node: item.cluster for item in clusters}

community_info = {}

for item in clusters:

cluster_id = item.cluster

node = item.node

if cluster_id not in community_info:

community_info[cluster_id] = []

for neighbor in nx_graph.neighbors(node):

if community_mapping[neighbor] == cluster_id:

edge_data = nx_graph.get_edge_data(node, neighbor)

if edge_data:

detail = f"{node} -> {neighbor} -> {edge_data['relationship']} -> {edge_data['description']}"

community_info[cluster_id].append(detail)

return community_info

def _summarize_communities(self, community_info):

"""Generate and store summaries for each community."""

for community_id, details in community_info.items():

details_text = (

"\n".join(details) + "."

) # Ensure it ends with a period

self.community_summary[

community_id

] = self.generate_community_summary(details_text)

def get_community_summaries(self):

"""Returns the community summaries, building them if not already done."""

if not self.community_summary:

self.build_communities()

return self.community_summary

from llama_index.core import PropertyGraphIndex

index = PropertyGraphIndex(
    nodes=nodes,
    property_graph_store=GraphRAGStore(),
    kg_extractors=[kg_extractor],
    show_progress=True,
)

from llama_index.core import PropertyGraphIndex

index = PropertyGraphIndex(

nodes=nodes,

property_graph_store=GraphRAGStore(),

kg_extractors=[kg_extractor],

show_progress=True,

)

Output:

Extracting paths from text: 100%|██████████| 50/50 [02:51<00:00,  3.43s/it]
Generating embeddings: 100%|██████████| 1/1 [00:01<00:00,  1.53s/it]
Generating embeddings: 100%|██████████| 4/4 [00:01<00:00,  2.27it/s]

Extracting paths from text: 100%|██████████| 50/50 [02:51<00:00, 3.43s/it]

Generating embeddings: 100%|██████████| 1/1 [00:01<00:00, 1.53s/it]

Generating embeddings: 100%|██████████| 4/4 [00:01<00:00, 2.27it/s]

Step 6: Detect Communities and Summarize

Use graspologic’s Hierarchical Leiden algorithm to detect communities and generate summaries. Communities are groups of nodes (entities) that are densely connected internally but sparsely connected to other groups. This algorithm maximizes a metric called modularity, which measures the quality of dividing a graph into communities.

index.property_graph_store.build_communities()

1	index.property_graph_store.build_communities()

Warning: Isolated nodes (nodes with no relationships) are ignored by the Leiden algorithm. This is expected when some nodes do not form meaningful connections, resulting in a warning. So, don’t panic if you encounter this.

Step 7: Query the Graph

Initialize the GraphRAGQueryEngine to query the processed data. When a query is submitted, the engine retrieves relevant community summaries from the GraphRAGStore. For each summary, it uses the LLM to generate a specific answer contextualized to the query via the generate_answer_from_summary method. These partial answers are then synthesized into a coherent final response using the aggregate_answers method, where the LLM combines multiple perspectives into a concise output.

from llama_index.core.query_engine import CustomQueryEngine
from llama_index.core.llms import LLM
class GraphRAGQueryEngine(CustomQueryEngine):
    graph_store: GraphRAGStore
    llm: LLM

    def custom_query(self, query_str: str) -> str:
        """Process all community summaries to generate answers to a specific query."""
        community_summaries = self.graph_store.get_community_summaries()
        community_answers = [
            self.generate_answer_from_summary(community_summary, query_str)
            for _, community_summary in community_summaries.items()
        ]

        final_answer = self.aggregate_answers(community_answers)
        return final_answer

    def generate_answer_from_summary(self, community_summary, query):
        """Generate an answer from a community summary based on a given query using LLM."""
        prompt = (
            f"Given the community summary: {community_summary}, "
            f"how would you answer the following query? Query: {query}"
        )
        messages = [
            ChatMessage(role="system", content=prompt),
            ChatMessage(
                role="user",
                content="I need an answer based on the above information.",
            ),
        ]
        response = self.llm.chat(messages)
        cleaned_response = re.sub(r"^assistant:\s*", "", str(response)).strip()
        return cleaned_response

    def aggregate_answers(self, community_answers):
        """Aggregate individual community answers into a final, coherent response."""
        # intermediate_text = " ".join(community_answers)
        prompt = "Combine the following intermediate answers into a final, concise response."
        messages = [
            ChatMessage(role="system", content=prompt),
            ChatMessage(
                role="user",
                content=f"Intermediate answers: {community_answers}",
            ),
        ]
        final_response = self.llm.chat(messages)
        cleaned_final_response = re.sub(
            r"^assistant:\s*", "", str(final_response)
        ).strip()
        return cleaned_final_response

from llama_index.core.query_engine import CustomQueryEngine

from llama_index.core.llms import LLM

class GraphRAGQueryEngine(CustomQueryEngine):

graph_store: GraphRAGStore

llm: LLM

def custom_query(self, query_str: str) -> str:

"""Process all community summaries to generate answers to a specific query."""

community_summaries = self.graph_store.get_community_summaries()

community_answers = [

self.generate_answer_from_summary(community_summary, query_str)

for _, community_summary in community_summaries.items()

]

final_answer = self.aggregate_answers(community_answers)

return final_answer

def generate_answer_from_summary(self, community_summary, query):

"""Generate an answer from a community summary based on a given query using LLM."""

prompt = (

f"Given the community summary: {community_summary}, "

f"how would you answer the following query? Query: {query}"

)

messages = [

ChatMessage(role="system", content=prompt),

ChatMessage(

role="user",

content="I need an answer based on the above information.",

]

response = self.llm.chat(messages)

cleaned_response = re.sub(r"^assistant:\s*", "", str(response)).strip()

return cleaned_response

def aggregate_answers(self, community_answers):

"""Aggregate individual community answers into a final, coherent response."""

# intermediate_text = " ".join(community_answers)

prompt = "Combine the following intermediate answers into a final, concise response."

messages = [

ChatMessage(role="system", content=prompt),

ChatMessage(

role="user",

content=f"Intermediate answers: {community_answers}",

]

final_response = self.llm.chat(messages)

cleaned_final_response = re.sub(

r"^assistant:\s*", "", str(final_response)

).strip()

return cleaned_final_response

query_engine = GraphRAGQueryEngine(
    graph_store=index.property_graph_store, llm=llm
)
response = query_engine.query("What are news related to financial sector?")
display(Markdown(f"{response.response}"))

query_engine = GraphRAGQueryEngine(

graph_store=index.property_graph_store, llm=llm

)

response = query_engine.query("What are news related to financial sector?")

display(Markdown(f"{response.response}"))

Output:

The majority of the provided summaries and information do not contain any news related to the financial sector. However, there are a few exceptions. Matt Pincus, through his company MUSIC, has made investments in Soundtrack Your Brand, indicating a financial commitment to support the company's growth. Nirmal Bang has given a Buy Rating to Tata Chemicals Ltd. (TTCH), suggesting a positive investment recommendation. Coinbase Global Inc. is involved in a legal conflict with the U.S. Securities and Exchange Commission (SEC) and is also engaged in a financial transaction involving the issuance of 0.50% Convertible Senior Notes. Deutsche Bank has recommended buying shares of Allegiant Travel and SkyWest, indicating promising opportunities in the aviation sector. Lastly, Coinbase Global, Inc. has repurchased 0.50% Convertible Senior Notes due 2026, indicating strategic financial management.

The majority of the provided summaries and information do not contain any news related to the financial sector. However, there are a few exceptions. Matt Pincus, through his company MUSIC, has made investments in Soundtrack Your Brand, indicating a financial commitment to support the company's growth. Nirmal Bang has given a Buy Rating to Tata Chemicals Ltd. (TTCH), suggesting a positive investment recommendation. Coinbase Global Inc. is involved in a legal conflict with the U.S. Securities and Exchange Commission (SEC) and is also engaged in a financial transaction involving the issuance of 0.50% Convertible Senior Notes. Deutsche Bank has recommended buying shares of Allegiant Travel and SkyWest, indicating promising opportunities in the aviation sector. Lastly, Coinbase Global, Inc. has repurchased 0.50% Convertible Senior Notes due 2026, indicating strategic financial management.

Wrapping Up

That’s all! I hope you enjoyed reading this article. There’s no doubt that Graph RAG enables you to answer both specific factual and complex abstract questions by understanding the relationships and structures within your data. However, it’s still in its early stages and has limitations, particularly in terms of token utilization, which is significantly higher than traditional RAG. Nevertheless, it’s an important development, and I personally look forward to seeing what’s next. If you have any questions or suggestions, feel free to share them in the comments section below.

5 Responses to Building a Graph RAG System: A Step-by-Step Approach

CH.Tseng December 4, 2024 at 7:52 pm #

Thank you very much., but I tried your program. It gives an error on this line:
community_hierarchical_clusters = hierarchical_leiden(

The error message is as follows:

in build_communities
community_hierarchical_clusters = hierarchical_leiden(
File “”, line 293, in hierarchical_leiden
File “/home/chtseng/envs/GraphRAG/lib/python3.10/site-packages/graspologic/partition/leiden.py”, line 588, in hierarchical_leiden
hierarchical_clusters_native = gn.hierarchical_leiden(
leiden.EmptyNetworkError: EmptyNetworkError

What’s the reason?

Reply
- James Carmichael December 5, 2024 at 8:27 am #
  
  Hi CS.Tseng…The error leiden.EmptyNetworkError indicates that the Leiden algorithm is being applied to an empty or invalid network. This usually happens when the input graph or network passed to the hierarchical_leiden function does not contain any edges or nodes, or if the preprocessing resulted in an invalid state. Here’s how you can troubleshoot and resolve the issue:
  
  ### Troubleshooting Steps:
  
  1. **Verify Input Graph:**
  – Ensure the input graph is not empty. Check the number of nodes and edges before passing it to the hierarchical_leiden function.
  python print("Number of nodes:", graph.number_of_nodes()) print("Number of edges:", graph.number_of_edges())
  
  2. **Check for Node and Edge Weights:**
  – If your graph is weighted, verify that all edges have valid weights and that there are no missing or None weights. If weights are required but not present, the function may treat the graph as invalid.
  
  3. **Preprocessing Pipeline:**
  – If there is a preprocessing step (e.g., filtering, subsetting), ensure it is not removing all edges or nodes inadvertently.
  python # Example check for node and edge lists print("Nodes:", list(graph.nodes())) print("Edges:", list(graph.edges(data=True)))
  
  4. **Self-Loops and Isolated Nodes:**
  – Some implementations may fail if there are only isolated nodes or self-loops. Filter out such nodes if necessary:
  python graph.remove_edges_from(nx.selfloop_edges(graph)) graph = nx.Graph(graph) # Recreate the graph without isolated nodes
  
  5. **Hierarchical Leiden Parameters:**
  – If you are using specific parameters, ensure they are correctly set. For example, confirm the modularity resolution is appropriate for your data.
  
  6. **Inspect the Graph Representation:**
  – Ensure the graph is represented in a compatible format. If you’re using networkx, ensure the conversion to the expected format (e.g., a native Graspologic graph object) is correct.
  
  ### Example of Input Validation:
  python import networkx as nx from graspologic.partition import hierarchical_leiden
  # Example graph creation graph = nx.karate_club_graph() # Check graph properties if graph.number_of_nodes() == 0 or graph.number_of_edges() == 0: raise ValueError("The input graph is empty. Please provide a valid graph.")
  # Run hierarchical Leiden try: community_hierarchical_clusters = hierarchical_leiden(graph) print("Communities:", community_hierarchical_clusters) except leiden.EmptyNetworkError as e: print("EmptyNetworkError:", e)
  
  ### Common Causes and Fixes:
  1. **Empty Graph:**
  – Ensure the graph is populated with valid edges and nodes.
  
  2. **Data Cleaning:**
  – Fix issues like duplicate edges, missing edge weights, or isolated nodes.
  
  3. **Graph Conversion:**
  – Ensure the graph is converted to the appropriate format before passing it to the function.
  
  Reply
Martin January 7, 2025 at 5:51 pm #

Hi, how to persist the KG and Summary? Currently the code doesn’t store anything

Reply
nongnongzi March 20, 2025 at 3:39 pm #

Hi, thank you for sharing this work!
I have a question about the property_graph_store passed to the index.
Cause in the following code:

index = PropertyGraphIndex(
nodes=nodes,
property_graph_store=GraphRAGStore(),
kg_extractors=[kg_extractor],
show_progress=True,
)

There is nothing passed to the GraphRAGStore(), meaning that there are no nodes and relationships stored in the GraphRAGStore. And In PropertyGraphIndex, there is no nodes and relations inserted into GraphRAGStore neither.
So how to build the summary and do query in the following step?

Thanks.

Reply
- James Carmichael March 21, 2025 at 5:41 pm #
  
  Great question — you’re digging into a subtle but important point about how the PropertyGraphIndex and GraphRAGStore work together.
  
  ### Let’s break it down step-by-step:
  
  #### 1. **GraphRAGStore()**
  When you initialize GraphRAGStore() without passing any data, it creates an *empty* property graph store. That’s true — there are no nodes or relationships at that point.
  
  python graph_store = GraphRAGStore() # Empty at this point
  
  #### 2. **PropertyGraphIndex**
  Now, when you do:
  
  python index = PropertyGraphIndex( nodes=nodes, property_graph_store=graph_store, kg_extractors=[kg_extractor], show_progress=True, )
  
  You’re providing nodes, which are likely unstructured text chunks, and a kg_extractor — a knowledge graph extractor that can process these chunks into triplets (subject, predicate, object).
  
  👉 **What happens inside PropertyGraphIndex?**
  
  The key here is that even though GraphRAGStore is initially empty, **the PropertyGraphIndex takes nodes, applies the kg_extractor, and populates the graph store internally**.
  
  – kg_extractor extracts structured data (triplets) from nodes.
  – Those triplets are then **inserted into the GraphRAGStore** inside PropertyGraphIndex.
  
  So yes, GraphRAGStore() is empty to start, but it gets filled during index construction.
  
  #### 3. **Querying / Summarization**
  After this step, the PropertyGraphIndex has built a graph with the help of the extractor and the graph store. From there, you can:
  
  – Query the index
  – Retrieve relevant triplets from GraphRAGStore
  – Build summaries based on traversals over the property graph
  
  #### TL;DR Answer:
  ✅ GraphRAGStore() starts empty
  ✅ PropertyGraphIndex uses nodes and kg_extractors to extract triplets
  ✅ These triplets are **added** to the GraphRAGStore
  ✅ So later queries/summaries work as expected
  
  —
  
  If you’re curious, you can inspect the graph_store after building the index to see the triplets:
  
  python print(graph_store.get_all_nodes()) print(graph_store.get_all_relationships())
  
  Reply

Navigation

Building a Graph RAG System: A Step-by-Step Approach

The Graph RAG Pipeline

Step 1: Source Documents → Text Chunks

Step 2: Text Chunks → Element Instances

Step 3: Element Instances → Element Summaries

Step 4: Element Summaries → Graph Communities

Step 5: Graph Communities → Community Summaries

Step 6: Community Summaries → Community Answers → Global Answer

Step-by-Step Implementation of GraphRAG with LlamaIndex

Step 1: Install Dependencies

Step 2: Load and Preprocess Data

Step 3: Split Text into Nodes

Step 4: Configure the LLM, Prompt, and GraphRAG Extractor

Step 5: Build the Graph Index

Step 6: Detect Communities and Summarize

Step 7: Query the Graph

Wrapping Up

More On This Topic

5 Responses to Building a Graph RAG System: A Step-by-Step Approach

Leave a Reply Click here to cancel reply.