Mastering Retrieval-Augmented Generation (RAG) on Semi-Structured Data: A Comprehensive Guide

Taming Complex Documents: Overcoming Challenges in RAG with PDFs, Docs, and Reports

Unpacking the Need for Advanced RAG Techniques in Mixed-Content Formats

A Smarter Approach: Introducing Intelligent Data Parsing and the Multi-Vector Retriever

Step-by-Step Building of an Effective RAG Pipeline for Complex Document Types

Conclusion: Transforming Document Complexity into Retrieval Efficiency

FAQs: Your Questions about RAG and Semi-Structured Data Answered

Mastering RAG with Semi-Structured Data: A Hands-On Guide

Have you ever tried running Retrieval-Augmented Generation (RAG) on complex documents like PDFs, Word files, or financial reports? If so, you know that not all documents consist of simple text. Many contain research papers, financial statements, and product manuals structured with a mix of text, tables, and other elements, posing a significant challenge for standard RAG systems.

Whether you’re working with academic papers or technical manuals, extracting useful information can be daunting. This guide will introduce a solutions-oriented approach, employing intelligent unstructured data parsing and a multi-vector retriever within the LangChain RAG framework.

Why RAG for Semi-Structured Data?

Traditional RAG pipelines often struggle with mixed-content documents. A basic text splitter might accidentally slice through a table, disregarding valuable data. Moreover, embedding the raw text from a large table often yields noisy vectors that are ineffective for semantic search. This results in language models missing critical context necessary for accurately answering user inquiries.

To combat these issues, we need a smarter system that intelligently separates text from tables, applying different strategies for storing and retrieving each type of data. This ensures our language model receives accurate, complete information to provide precise answers.

The Solution: Intelligent Data Parsing and Retrieval

Our method focuses on two primary components aimed at preserving the context and structure of the original documents:

1. Intelligent Data Parsing

Utilizing the Unstructured library, we employ its partition_pdf function to perform layout analysis and differentiate between texts and tables. This avoids the pitfalls of blindly splitting text while preserving the structural integrity of our data.

2. The Multi-Vector Retriever

The core of our advanced RAG technique lies in the multi-vector retriever, which allows us to maintain multiple representations of the data. For retrieval, we generate concise summaries of our text chunks and tables, enhancing their usability for embedding and similarity search. When it comes to answer generation, the full raw content is passed to the language model, ensuring the model accesses complete context.

Workflow Overview

Here’s how the entire process breaks down:

Data Loading and Parsing: Utilize Unstructured to load and parse the PDF into distinguishable elements like tables and text.
Creating Summaries: Generate concise summaries for more efficient retrieval.
Building the Multi-Vector Retriever: Link summaries to raw data while storing them separately.
Running the RAG Chain: Establish a seamless pipeline to query the language model using the retrieved data.

Step-by-Step Pipeline Construction

Let’s walk through building this system, step by step, using the LLaMA2 research paper as an illustrative example.

Step 1: Setting Up the Environment

Begin by installing the necessary Python packages for our environment:

!pip install langchain langchain-chroma "unstructured[all-docs]" pydantic lxml langchainhub langchain_openai -q

If you’re on macOS, you can install additional tools needed by Unstructured:

!apt-get install -y tesseract-ocr
!apt-get install -y poppler-utils

Step 2: Data Loading and Parsing

Next, we’ll leverage the Unstructured library to process our PDF.

from unstructured.partition.pdf import partition_pdf

raw_pdf_elements = partition_pdf(
    filename="LLaMA2.pdf",
    extract_images_in_pdf=False,
    infer_table_structure=True,
    chunking_strategy="by_title",
    max_characters=4000,
    new_after_n_chars=3800,
    combine_text_under_n_chars=2000,
)

# Categorizing the elements
table_elements, text_elements = [], []
for element in raw_pdf_elements:
    if isinstance(element, Table):
        table_elements.append(element)
    elif isinstance(element, CompositeElement):
        text_elements.append(element)

Step 3: Creating Summaries

Now, create concise summaries of each extracted element for more effective retrieval using LangChain’s capabilities.

from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

prompt = ChatPromptTemplate.from_template("Summarize: {element}")
model = ChatOpenAI(temperature=0)

table_summaries = [model(text) for text in table_elements]
text_summaries = [model(text) for text in text_elements]

Step 4: Building the Multi-Vector Retriever

We’ll establish a multi-vector retriever to store our summaries and raw content.

from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryStore
from langchain_chroma import Chroma

# Vector store and doc store
vectorstore = Chroma(collection_name="summaries")
store = InMemoryStore()

retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    docstore=store,
    id_key="doc_id",
)

# Add elements to the retriever
for summary, element in zip(table_summaries, table_elements):
    retriever.vectorstore.add_document(summary)
    retriever.docstore.add_document(element)

Step 5: Executing the RAG Chain

Finalizing our pipeline involves constructing the RAG chain for querying.

from langchain_core.runnables import RunnablePassthrough

prompt_template = """Answer based on the following context:
{context}

Question: {question}
"""

chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | ChatPromptTemplate.from_template(prompt_template)
    | model
)

response = chain.invoke("What is the number of training tokens for LLaMA2?")

Conclusion

Navigating documents with mixed content is increasingly common in real-world applications. A simplistic RAG pipeline often falls short. By integrating intelligent data parsing with a multi-vector retriever, we achieve a more robust system that treats the unique structure of documents as an asset. This method ensures the language model has the complete context, yielding accurate and reliable answers.

For hands-on practice, access the code via the Colab notebook or GitHub repository linked below.

FAQs

Q1: Can this method adapt to other file formats?
A: Yes, the Unstructured library supports various file types; simply switch to the appropriate parsing function, like partition_docx.

Q2: Are summaries the only option for the multi-vector retriever?
A: Besides summaries, you can generate hypothetical questions or embed smaller chunks of text.

Q3: Why not embed entire tables as raw text?
A: Large tables might create noisy embeddings. Summarizing captures the essence and enables effective semantic search.

About the Author

Harsh Mishra is an AI/ML Engineer who enjoys conversing with Large Language Models as much as optimizing his coffee intake. Passionate about GenAI and NLP, he’s dedicated to making machines smarter—at least until they outsmart him! 🚀☕

Feel free to explore more: Build a RAG Pipeline using Llama Index

Exclusive Content:

A Developer’s Guide to RAG with Semi-Structured Data