Building a RAG Pipeline with llama.cpp in Python

By Iván Palomares Carrascosa on April 19, 2025 in Language Models 6

Building a RAG Pipeline with llama.cpp in Python
Image by Editor | Midjourney

Using llama.cpp enables efficient and accessible inference of large language models (LLMs) on local devices, particularly when running on CPUs. This article takes this capability to a full retrieval augmented generation (RAG) level, providing a practical, example-based guide to building a RAG pipeline with this framework using Python.

Step-by-Step Process

First, we install the necessary packages:

pip install llama-cpp-python
pip install langchain langchain-community sentence-transformers chromadb
pip install pypdf requests pydantic tqdm

pip install llama-cpp-python

pip install langchain langchain-community sentence-transformers chromadb

pip install pypdf requests pydantic tqdm

Bear in mind the initial components setup will take a few minutes to complete if none were installed before in your running environment.

After installing llama.cpp, Langchain, and other components such as pypdf for handling PDF documents in document corpus, it’s time to import all we need.

import os
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.document_loaders import PyPDFLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain.llms import LlamaCpp
import requests
from tqdm import tqdm
import time

import os

from langchain.embeddings import HuggingFaceEmbeddings

from langchain.vectorstores import Chroma

from langchain.document_loaders import PyPDFLoader, TextLoader

from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain.chains import RetrievalQA

from langchain.prompts import PromptTemplate

from langchain.llms import LlamaCpp

import requests

from tqdm import tqdm

import time

Time to get started with the real process. The first thing we need is locally downloading an LLM. Even though in a real scenario you may want a bigger LLM, to make our example relatively lightweight, we will load a relatively smaller LLM (I know, this just sounded contradictory!), namely the Llama 2 7B quantized model, available from Hugging Face:

model_path = "llama-2-7b-chat.Q4_K_M.gguf"

if not os.path.exists(model_path):
    print(f"Downloading {model_path}...")
    # You may want to replace the model URL by another of your choice
    model_url = "https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf"
    response = requests.get(model_url, stream=True)
    total_size = int(response.headers.get('content-length', 0))
    
    with open(model_path, 'wb') as f:
        for data in tqdm(response.iter_content(chunk_size=1024), total=total_size//1024):
            f.write(data)
    print("Download complete!")

model_path = "llama-2-7b-chat.Q4_K_M.gguf"

if not os.path.exists(model_path):

print(f"Downloading {model_path}...")

# You may want to replace the model URL by another of your choice

model_url = "https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf"

response = requests.get(model_url, stream=True)

total_size = int(response.headers.get('content-length', 0))

with open(model_path, 'wb') as f:

for data in tqdm(response.iter_content(chunk_size=1024), total=total_size//1024):

f.write(data)

print("Download complete!")

Intuitively, we now need to set up another major component in any RAG system: the document base. In this example, we will create a mechanism to read documents in multiple formats, including .doc and .txt, and for simplicity we will provide a default sample text document built on the fly, adding it to our newly created documents directory, docs. To try it yourself with an extra level of fun, make sure you load actual documents of your own.

os.makedirs("docs", exist_ok=True)

# Sample text for demonstration purposes
with open("docs/sample.txt", "w") as f:
    f.write("""
    Retrieval-Augmented Generation (RAG) is a technique that combines retrieval-based and generation-based approaches
    for natural language processing tasks. It involves retrieving relevant information from a knowledge base and then 
    using that information to generate more accurate and informed responses.
    
    RAG models first retrieve documents that are relevant to a given query, then use these documents as additional context
    for language generation. This approach helps to ground the model's responses in factual information and reduces hallucinations.
    
    The llama.cpp library is a C/C++ implementation of Meta's LLaMA model, optimized for CPU usage. It allows running LLaMA models
    on consumer hardware without requiring high-end GPUs.
    
    LocalAI is a framework that enables running AI models locally without relying on cloud services. It provides APIs compatible
    with OpenAI's interfaces, allowing developers to use their own models with the same code they would use for OpenAI services.
    """)

documents = []
for file in os.listdir("docs"):
    if file.endswith(".pdf"):
        loader = PyPDFLoader(os.path.join("docs", file))
        documents.extend(loader.load())
    elif file.endswith(".txt"):
        loader = TextLoader(os.path.join("docs", file))
        documents.extend(loader.load())

# Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len
)

chunks = text_splitter.split_documents(documents)

os.makedirs("docs", exist_ok=True)

# Sample text for demonstration purposes

with open("docs/sample.txt", "w") as f:

f.write("""

Retrieval-Augmented Generation (RAG) is a technique that combines retrieval-based and generation-based approaches

for natural language processing tasks. It involves retrieving relevant information from a knowledge base and then

using that information to generate more accurate and informed responses.

RAG models first retrieve documents that are relevant to a given query, then use these documents as additional context

for language generation. This approach helps to ground the model's responses in factual information and reduces hallucinations.

The llama.cpp library is a C/C++ implementation of Meta's LLaMA model, optimized for CPU usage. It allows running LLaMA models

on consumer hardware without requiring high-end GPUs.

LocalAI is a framework that enables running AI models locally without relying on cloud services. It provides APIs compatible

with OpenAI's interfaces, allowing developers to use their own models with the same code they would use for OpenAI services.

""")

documents = []

for file in os.listdir("docs"):

if file.endswith(".pdf"):

loader = PyPDFLoader(os.path.join("docs", file))

documents.extend(loader.load())

elif file.endswith(".txt"):

loader = TextLoader(os.path.join("docs", file))

documents.extend(loader.load())

# Split documents into chunks

text_splitter = RecursiveCharacterTextSplitter(

chunk_size=1000,

chunk_overlap=200,

length_function=len

)

chunks = text_splitter.split_documents(documents)

Notice that after processing the documents, we split them into chunks, which is a common practice in RAG systems for enhancing retrieval accuracy and ensuring the LLM effectively processes manageable inputs within its context window.

Both LLMs and RAG systems need to handle numerical representations of text rather than raw text, therefore, we next build a vector store that contains embeddings of our text documents. Chroma is a lightweight, open-source vector database for efficiently storing and querying embeddings.

embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

vectorstore = Chroma.from_documents(

documents=chunks,

embedding=embeddings,

persist_directory="./chroma_db"

)

Now llama.cpp enters the scene for initializing our previously downloaded LLM. To do this, a LlamaCpp object is instantiated with the model path and other settings like model temperature, maximum context length, and so on.

llm = LlamaCpp(
    model_path=model_path,
    temperature=0.7,
    max_tokens=2000,
    n_ctx=4096,
    verbose=False
)

llm = LlamaCpp(

model_path=model_path,

temperature=0.7,

max_tokens=2000,

n_ctx=4096,

verbose=False

)

We are getting closer to the inference show, and just a few actors remain to appear on stage. One is the RAG prompt template, which is an elegant way to define how the retrieved context and user query are combined into a single, well-structured input for the LLM during inference.

template = """
Answer the question based on the following context:

{context}

Question: {question}
Answer:
"""
prompt = PromptTemplate(
    template=template,
    input_variables=["context", "question"]
)

template = """

Answer the question based on the following context:

{context}

Question: {question}

Answer:

"""

prompt = PromptTemplate(

template=template,

input_variables=["context", "question"]

)

Finally, we put everything together to create our RAG pipeline based on llama.cpp.

rag_pipeline = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt}
)

rag_pipeline = RetrievalQA.from_chain_type(

llm=llm,

chain_type="stuff",

retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),

return_source_documents=True,

chain_type_kwargs={"prompt": prompt}

)

Let’s review the building blocks of the RAG pipeline we just created for a better understanding:

llm: the LLM downloaded and then initialized using llama.cpp
chain_type: a method to specify how the retrieved documents in an RAG system are put together and sent to the LLM, with "stuff" meaning that all retrieved context is injected in the prompt.
retriever: initialized upon the vector store and configured to get the three most relevant document chunks.
return_source_documents=True: used to obtain information about which document chunks were used to answer the user’s question.
chain_type_kwargs={"prompt": prompt}: enables the use of our recently defined custom template to format the retrieval-augmented input into a presentable format for the LLM.

To finalize and see everything in action, we define and utilize a pipeline-driving function, ask_question(), that runs the RAG pipeline to answer the user’s questions.

def ask_question(question):
    start_time = time.time()
    result = rag_pipeline({"query": question})
    end_time = time.time()
    
    print(f"Question: {question}")
    print(f"Answer: {result['result']}")
    print(f"Time taken: {end_time - start_time:.2f} seconds")
    print("\nSource documents:")
    for i, doc in enumerate(result["source_documents"]):
        print(f"Document {i+1}:")
        print(f"Source: {doc.metadata.get('source', 'Unknown')}")
        print(f"Content: {doc.page_content[:150]}...\n")

def ask_question(question):

start_time = time.time()

result = rag_pipeline({"query": question})

end_time = time.time()

print(f"Question: {question}")

print(f"Answer: {result['result']}")

print(f"Time taken: {end_time - start_time:.2f} seconds")

print("\nSource documents:")

for i, doc in enumerate(result["source_documents"]):

print(f"Document {i+1}:")

print(f"Source: {doc.metadata.get('source', 'Unknown')}")

print(f"Content: {doc.page_content[:150]}...\n")

Now let’s try out our pipeline with some specific questions.

ask_question("What is RAG and how does it work?")
ask_question("What is llama.cpp?")
ask_question("How does LocalAI relate to cloud AI services?")

ask_question("What is RAG and how does it work?")

ask_question("What is llama.cpp?")

ask_question("How does LocalAI relate to cloud AI services?")

Result:

Question: What is RAG and how does it work?
Answer: RAG is a combination of retrieval-based and generation-based approaches for natural language processing tasks. It involves retrieving relevant information from a knowledge base and using that information to generate more accurate and informed responses. RAG models first retrieve documents that are relevant to a given query, then use these documents as additional context for language generation. This approach helps to ground the model's responses in factual information and reduces hallucinations.
Time taken: 195.05 seconds

Source documents:
Document 1:
Source: docs/sample.txt
Content: Retrieval-Augmented Generation (RAG) is a technique that combines retrieval-based and generation-based approaches
    for natural language processing ...

Document 2:
Source: docs/sample.txt
Content: on consumer hardware without requiring high-end GPUs.
    
    LocalAI is a framework that enables running AI models locally without relying on cloud ...

Question: What is llama.cpp?
Answer: llama.cpp is a C/C++ implementation of Meta's LLaMA model, optimized for CPU usage. It allows running LLaMA models on consumer hardware without requiring high-end GPUs.
Time taken: 35.61 seconds

Source documents:
Document 1:
Source: docs/sample.txt
Content: Retrieval-Augmented Generation (RAG) is a technique that combines retrieval-based and generation-based approaches
    for natural language processing ...

Document 2:
Source: docs/sample.txt
Content: on consumer hardware without requiring high-end GPUs.
    
    LocalAI is a framework that enables running AI models locally without relying on cloud ...

Question: How does LocalAI relate to cloud AI services?
Answer: LocalAI is a framework that enables running AI models locally without relying on cloud services. It provides APIs compatible with OpenAI's interfaces, allowing developers to use their own models with the same code they would use for OpenAI services. This means that LocalAI allows developers to use their own AI models, trained on their own data, without having to rely on cloud-based services.
Time taken: 182.07 seconds

Source documents:
Document 1:
Source: docs/sample.txt
Content: on consumer hardware without requiring high-end GPUs.
    
    LocalAI is a framework that enables running AI models locally without relying on cloud ...

Document 2:
Source: docs/sample.txt
Content: Retrieval-Augmented Generation (RAG) is a technique that combines retrieval-based and generation-based approaches
    for natural language processing ...

Question: What is RAG and how does it work?

Answer: RAG is a combination of retrieval-based and generation-based approaches for natural language processing tasks. It involves retrieving relevant information from a knowledge base and using that information to generate more accurate and informed responses. RAG models first retrieve documents that are relevant to a given query, then use these documents as additional context for language generation. This approach helps to ground the model's responses in factual information and reduces hallucinations.

Time taken: 195.05 seconds

Source documents:

Document 1:

Source: docs/sample.txt

Content: Retrieval-Augmented Generation (RAG) is a technique that combines retrieval-based and generation-based approaches

for natural language processing ...

Document 2:

Source: docs/sample.txt

Content: on consumer hardware without requiring high-end GPUs.

LocalAI is a framework that enables running AI models locally without relying on cloud ...

Question: What is llama.cpp?

Answer: llama.cpp is a C/C++ implementation of Meta's LLaMA model, optimized for CPU usage. It allows running LLaMA models on consumer hardware without requiring high-end GPUs.

Time taken: 35.61 seconds

Source documents:

Document 1:

Source: docs/sample.txt

Content: Retrieval-Augmented Generation (RAG) is a technique that combines retrieval-based and generation-based approaches

for natural language processing ...

Document 2:

Source: docs/sample.txt

Content: on consumer hardware without requiring high-end GPUs.

LocalAI is a framework that enables running AI models locally without relying on cloud ...

Question: How does LocalAI relate to cloud AI services?

Answer: LocalAI is a framework that enables running AI models locally without relying on cloud services. It provides APIs compatible with OpenAI's interfaces, allowing developers to use their own models with the same code they would use for OpenAI services. This means that LocalAI allows developers to use their own AI models, trained on their own data, without having to rely on cloud-based services.

Time taken: 182.07 seconds

Source documents:

Document 1:

Source: docs/sample.txt

Content: on consumer hardware without requiring high-end GPUs.

LocalAI is a framework that enables running AI models locally without relying on cloud ...

Document 2:

Source: docs/sample.txt

Content: Retrieval-Augmented Generation (RAG) is a technique that combines retrieval-based and generation-based approaches

for natural language processing ...

Wrapping Up

This article demonstrated how to set up and utilize a local RAG pipeline efficiently using llama.cpp, a popular framework for running inference on existing LLM locally in a lightweight and portable fashion. You should now be able to apply these newly-learned skills in your own projects.

6 Responses to Building a RAG Pipeline with llama.cpp in Python

Todd April 19, 2025 at 10:25 pm #

This is an interesting article because it can help you set up a basic RAG pipeline with inference. The real trick with RAG is how to architect it and tune it to give you relevant insights. I’m my experience writing a RAG pipeline, the pipeline is just the foundation. How you chunk, how much overlap you have, using techniques like multi-query or HyDE. All of these are critical to get good responses. Depending on your dataset and how you’re using it, it can be quite tricky.

- James Carmichael April 20, 2025 at 4:35 am #
  
  Thank you for your feedback Todd! We greatly appreciate your feedback!
  
Gold April 20, 2025 at 1:19 pm #

Awesome read I’ll be replicating these soon and play around with it

- James Carmichael April 21, 2025 at 7:44 am #
  
  Thank you for your feedback Gold! Keep us posted on your progress!
  
Johnnyblaze April 22, 2025 at 4:12 am #

Awesome concept 😎

- James Carmichael April 22, 2025 at 5:39 am #
  
  Thank you for your feedback Johnnyblaze!

Navigation

Building a RAG Pipeline with llama.cpp in Python

Step-by-Step Process

Wrapping Up

More On This Topic

6 Responses to Building a RAG Pipeline with llama.cpp in Python

Leave a Reply Click here to cancel reply.