Exclusive Content:

Haiper steps out of stealth mode, secures $13.8 million seed funding for video-generative AI

Haiper Emerges from Stealth Mode with $13.8 Million Seed...

“Revealing Weak Infosec Practices that Open the Door for Cyber Criminals in Your Organization” • The Register

Warning: Stolen ChatGPT Credentials a Hot Commodity on the...

VOXI UK Launches First AI Chatbot to Support Customers

VOXI Launches AI Chatbot to Revolutionize Customer Services in...

A Developer’s Guide to RAG with Semi-Structured Data

Mastering Retrieval-Augmented Generation (RAG) on Semi-Structured Data: A Comprehensive Guide

Taming Complex Documents: Overcoming Challenges in RAG with PDFs, Docs, and Reports

Unpacking the Need for Advanced RAG Techniques in Mixed-Content Formats

A Smarter Approach: Introducing Intelligent Data Parsing and the Multi-Vector Retriever

Step-by-Step Building of an Effective RAG Pipeline for Complex Document Types

Conclusion: Transforming Document Complexity into Retrieval Efficiency

FAQs: Your Questions about RAG and Semi-Structured Data Answered

Mastering RAG with Semi-Structured Data: A Hands-On Guide

Have you ever tried running Retrieval-Augmented Generation (RAG) on complex documents like PDFs, Word files, or financial reports? If so, you know that not all documents consist of simple text. Many contain research papers, financial statements, and product manuals structured with a mix of text, tables, and other elements, posing a significant challenge for standard RAG systems.

Whether you’re working with academic papers or technical manuals, extracting useful information can be daunting. This guide will introduce a solutions-oriented approach, employing intelligent unstructured data parsing and a multi-vector retriever within the LangChain RAG framework.

Why RAG for Semi-Structured Data?

Traditional RAG pipelines often struggle with mixed-content documents. A basic text splitter might accidentally slice through a table, disregarding valuable data. Moreover, embedding the raw text from a large table often yields noisy vectors that are ineffective for semantic search. This results in language models missing critical context necessary for accurately answering user inquiries.

To combat these issues, we need a smarter system that intelligently separates text from tables, applying different strategies for storing and retrieving each type of data. This ensures our language model receives accurate, complete information to provide precise answers.

The Solution: Intelligent Data Parsing and Retrieval

Our method focuses on two primary components aimed at preserving the context and structure of the original documents:

1. Intelligent Data Parsing

Utilizing the Unstructured library, we employ its partition_pdf function to perform layout analysis and differentiate between texts and tables. This avoids the pitfalls of blindly splitting text while preserving the structural integrity of our data.

2. The Multi-Vector Retriever

The core of our advanced RAG technique lies in the multi-vector retriever, which allows us to maintain multiple representations of the data. For retrieval, we generate concise summaries of our text chunks and tables, enhancing their usability for embedding and similarity search. When it comes to answer generation, the full raw content is passed to the language model, ensuring the model accesses complete context.

Workflow Overview

Here’s how the entire process breaks down:

  1. Data Loading and Parsing: Utilize Unstructured to load and parse the PDF into distinguishable elements like tables and text.
  2. Creating Summaries: Generate concise summaries for more efficient retrieval.
  3. Building the Multi-Vector Retriever: Link summaries to raw data while storing them separately.
  4. Running the RAG Chain: Establish a seamless pipeline to query the language model using the retrieved data.

Step-by-Step Pipeline Construction

Let’s walk through building this system, step by step, using the LLaMA2 research paper as an illustrative example.

Step 1: Setting Up the Environment

Begin by installing the necessary Python packages for our environment:

!pip install langchain langchain-chroma "unstructured[all-docs]" pydantic lxml langchainhub langchain_openai -q

If you’re on macOS, you can install additional tools needed by Unstructured:

!apt-get install -y tesseract-ocr
!apt-get install -y poppler-utils

Step 2: Data Loading and Parsing

Next, we’ll leverage the Unstructured library to process our PDF.

from unstructured.partition.pdf import partition_pdf

raw_pdf_elements = partition_pdf(
    filename="LLaMA2.pdf",
    extract_images_in_pdf=False,
    infer_table_structure=True,
    chunking_strategy="by_title",
    max_characters=4000,
    new_after_n_chars=3800,
    combine_text_under_n_chars=2000,
)

# Categorizing the elements
table_elements, text_elements = [], []
for element in raw_pdf_elements:
    if isinstance(element, Table):
        table_elements.append(element)
    elif isinstance(element, CompositeElement):
        text_elements.append(element)

Step 3: Creating Summaries

Now, create concise summaries of each extracted element for more effective retrieval using LangChain’s capabilities.

from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

prompt = ChatPromptTemplate.from_template("Summarize: {element}")
model = ChatOpenAI(temperature=0)

table_summaries = [model(text) for text in table_elements]
text_summaries = [model(text) for text in text_elements]

Step 4: Building the Multi-Vector Retriever

We’ll establish a multi-vector retriever to store our summaries and raw content.

from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryStore
from langchain_chroma import Chroma

# Vector store and doc store
vectorstore = Chroma(collection_name="summaries")
store = InMemoryStore()

retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    docstore=store,
    id_key="doc_id",
)

# Add elements to the retriever
for summary, element in zip(table_summaries, table_elements):
    retriever.vectorstore.add_document(summary)
    retriever.docstore.add_document(element)

Step 5: Executing the RAG Chain

Finalizing our pipeline involves constructing the RAG chain for querying.

from langchain_core.runnables import RunnablePassthrough

prompt_template = """Answer based on the following context:
{context}

Question: {question}
"""

chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | ChatPromptTemplate.from_template(prompt_template)
    | model
)

response = chain.invoke("What is the number of training tokens for LLaMA2?")

Conclusion

Navigating documents with mixed content is increasingly common in real-world applications. A simplistic RAG pipeline often falls short. By integrating intelligent data parsing with a multi-vector retriever, we achieve a more robust system that treats the unique structure of documents as an asset. This method ensures the language model has the complete context, yielding accurate and reliable answers.

For hands-on practice, access the code via the Colab notebook or GitHub repository linked below.

FAQs

Q1: Can this method adapt to other file formats?
A: Yes, the Unstructured library supports various file types; simply switch to the appropriate parsing function, like partition_docx.

Q2: Are summaries the only option for the multi-vector retriever?
A: Besides summaries, you can generate hypothetical questions or embed smaller chunks of text.

Q3: Why not embed entire tables as raw text?
A: Large tables might create noisy embeddings. Summarizing captures the essence and enables effective semantic search.


About the Author

Harsh Mishra is an AI/ML Engineer who enjoys conversing with Large Language Models as much as optimizing his coffee intake. Passionate about GenAI and NLP, he’s dedicated to making machines smarter—at least until they outsmart him! 🚀☕

Feel free to explore more: Build a RAG Pipeline using Llama Index

Latest

How Gemini Resolved My Major Audio Transcription Issue When ChatGPT Couldn’t

The AI Battle: Gemini 3 Pro vs. ChatGPT in...

MIT Researchers: This Isn’t an Iris, It’s the Future of Robotic Muscles

Bridging the Gap: MIT's Breakthrough in Creating Lifelike Robotic...

New ‘Postal’ Game Canceled Just a Day After Announcement Amid Generative AI Controversy

Backlash Forces Cancellation of Postal: Bullet Paradise Over AI-Art...

AI Therapy Chatbots: A Concerning Trend

Growing Concerns Over AI Chatbots: The Call for Stricter...

Don't miss

Haiper steps out of stealth mode, secures $13.8 million seed funding for video-generative AI

Haiper Emerges from Stealth Mode with $13.8 Million Seed...

VOXI UK Launches First AI Chatbot to Support Customers

VOXI Launches AI Chatbot to Revolutionize Customer Services in...

Investing in digital infrastructure key to realizing generative AI’s potential for driving economic growth | articles

Challenges Hindering the Widescale Deployment of Generative AI: Legal,...

Microsoft launches new AI tool to assist finance teams with generative tasks

Microsoft Launches AI Copilot for Finance Teams in Microsoft...

HyperPod Introduces Multi-Instance GPU Support to Optimize GPU Utilization for Generative...

Unlocking Efficient GPU Utilization with NVIDIA Multi-Instance GPU in Amazon SageMaker HyperPod Revolutionizing Workloads with GPU Partitioning Amazon SageMaker HyperPod now supports GPU partitioning using NVIDIA...

Warner Bros. Discovery Realizes 60% Cost Savings and Accelerated ML Inference...

Transforming Personalized Content Recommendations at Warner Bros. Discovery with AWS Graviton Insights from Machine Learning Engineering Leaders on Cost-Effective, Scalable Solutions for Global Audiences Innovating Content...

Implementing Strategies to Bridge the AI Value Gap

Bridging the AI Value Gap: Strategies for Successful Transformation in Businesses This heading captures the essence of the content, reflecting the need for actionable strategies...