Streamlining Document Processing: Introducing Multi-Document Discovery for Intelligent Document Processing (IDP)

Overcoming Schema Challenges in Large Document Collections

The IDP Accelerator: Revolutionizing Document Processing

Automated Solution Overview for Document Schema Generation

Technical Breakdown: Embeddings, Clustering, and Schema Creation

Step-by-Step Guide to Running Your Own Discovery Job

Best Practices for Successful Document Processing

Conclusion: Transforming Document Chaos into Structured Insights

Meet the Experts Behind the Innovation

Transforming Document Processing: Automating Schema Discovery with Multi-Document Discovery

Understanding the Challenge of Intelligent Document Processing (IDP)

Before extracting valuable information from documents using Intelligent Document Processing (IDP) techniques, the first step is to establish a schema for each document class that dictates what data to extract. But imagine facing the daunting task of creating schemas when thousands of documents are at your disposal and you don’t even know what classes exist. This mammoth task often becomes a bottleneck, requiring substantial manual effort and making it challenging to move forward with downstream IDP initiatives.

In this blog post, we’re excited to introduce our innovative multi-document discovery feature, which addresses this hurdle by automating the discovery and schema generation process.

The IDP Accelerator: Setting the Stage

The IDP Accelerator is an open-source, scalable, and serverless solution designed for automated document processing and information extraction. It provides a framework where you can define document types and fields through a configuration file. For a minimal configuration example, check out our IDP Accelerator GitHub repository.

However, without a comprehensive understanding of your document types, crafting this initial schema can prove challenging. Though the IDP Accelerator features a Discovery Module capable of bootstrapping a class configuration from a single example document, having a representative sample available is a prerequisite. This is where our multi-document discovery feature shines—eliminating the need for predefined document classes and expediting your journey toward processing a collection of unlabeled documents.

Solution Overview: How It Works

The video below demonstrates the multi-document discovery feature within the IDP Accelerator Console.

The multi-document discovery feature provides an automated transformation process for unclassified document collections. Integrated with the existing Discovery Module, it introduces a new "Multiple Document" capability. With the orchestration powered by AWS Step Functions and AWS Lambda, the process involves:

Document Processing: Pulling documents from an Amazon S3 bucket or a Zip file.
Embedding Generation: Converting documents into vector embeddings using available models from Amazon Bedrock.
Clustering: Grouping similar documents based on these embeddings.
Schema Generation: Automatic identifier for document types and schema creation with agentic capabilities.
Reflection and Review: Analyzing generated schemas for overlaps and inconsistencies before final review.

Technical Details: A Closer Look

Let’s explore each critical component of this innovative solution.

Embedding Generation

For each document, we generate an embedding that translates visual features into numerical representations. Notably, for multi-page documents, only the first page is used. Using visual embeddings, as opposed to text-based OCR, allows us to capture layout and structural nuances that differentiate document types, even when their textual content is similar. In this workflow, we leverage Cohere Embed v4 via Amazon Bedrock.

Document Clustering

To uncover the variety of document types in your collection, the multi-document discovery feature utilizes the silhouette score, a metric that assesses how well clusters are distinguished from one another and how compact their contents are. Using k-means clustering, it tests different values of k, ultimately selecting the number of document types based on the highest silhouette score.

Benchmarking Embeddings and Clustering

Our rigorous testing with the OCR-benchmark dataset validated that our embedding and clustering methodology can effectively deliver accurate document classifications even without labeled training data. Clustering achieved excellent scores, demonstrating the capability of high-quality multimodal embeddings.

Agentic Schema Generation

After identifying clusters, we invoke a Strands Agent to determine the document type and generate schemas for each cluster autonomously. This model-driven approach enables flexible reasoning as it samples documents strategically across the entire cluster.

Schema Analysis

Once schemas are generated, our analysis step evaluates the differentiation among outputs, identifying overlaps or inconsistencies and generating recommendations for improvements.

Running a Job on Your Documents

To try the multi-document discovery workflow on your own documents, follow these simple steps in the IDP Accelerator Console:

Create a Configuration:
- Navigate to the Configuration section and initialize a new configuration.
Run Multi-Document Discovery:
- Start the discovery process by selecting your document source, either from S3 or Zip upload.
Monitor Job Progress:
- Keep an eye on your discovery job’s execution status and review the quality report once completed.

Best Practices for Optimal Results

Before running the discovery job at scale, keep these best practices in mind:

Ensure your documents are single-document files, as the current workflow processes only the first page of PDFs.
Thoroughly review the quality report for overlapping clusters or uneven distributions before finalizing your schemas.

Next Steps: What to Do After Discovery

Depending on the discovery results:

If your schemas are clean with a quality report showing minimal overlap, you’re set to run IDP at scale.
If overlapping clusters are flagged, refine the generated schemas based on recommendations.
For inconsistent schema quality, consider retesting on a more balanced subset of documents.

Conclusion

In this post, we’ve demonstrated how the multi-document discovery feature efficiently navigates the complexities of schema creation in IDP by automating the process of transforming unclaimed document collections into structured, review-ready schemas.

We’d love to hear your experiences with this feature! Share your thoughts, queries, or insights in the comments below and, if you encounter any issues, feel free to contribute on our GitHub repository.

About the Authors

Grace Lang – Deep Learning Architect focused on delivering generative AI solutions.

Bob Strahan – Principal Solutions Architect specializing in advanced technical solutions.

David Kaleko – Senior Applied Scientist leading research in generative AI implementation strategies.

Spencer Romo – Senior Data Scientist with expertise in intelligent document processing and AI implementation.

Explore, experiment, and elevate your document processing capabilities with the IDP Accelerator!

Exclusive Content:

Automating Schema Creation for Smart Document Processing