Enhancing Data Ingestion in Amazon OpenSearch with Machine Learning Connectors

Introduction

When augmenting data for Amazon OpenSearch, various challenges arise. This post explores how to address these through two key third-party ML connectors.

Solution Overview

Utilizing Amazon Comprehend with OpenSearch, this section outlines how to set up the infrastructure and necessary resources.

Prerequisites

Ensure you have the required AWS account access to implement the solution effectively.

Part 1: The Amazon Comprehend ML Connector

Setting Up OpenSearch for Amazon Comprehend

Learn how to link OpenSearch with Amazon Comprehend by configuring IAM roles correctly for API access.

Setting Up the ML Connector

Follow the steps to establish the connection between OpenSearch and Amazon Comprehend.

Registering the Amazon Comprehend API Connector

Understand how to register the API connector within OpenSearch.

Testing the Amazon Comprehend API

Validate the setup by invoking the API and inspecting the results.

Creating an Ingest Pipeline for Language Annotation

Learn how to create a pipeline that leverages Amazon Comprehend’s capabilities during data ingestion.

Part 2: The Amazon Bedrock ML Connector

Load Sentences from JSON Documents

Discover how to load and structure data from JSON files for processing.

Creating the OpenSearch ML Connector to Amazon Bedrock

Establish a connector to access Amazon Bedrock’s Titan embeddings.

Testing the Amazon Titan Embeddings Model

Verify the proper configuration by testing the embedding model.

Creating the Index Pipeline with Titan Embeddings

Learn about setting up a pipeline designed to work with the Titan embeddings.

Creating an Index

Details about how to create an index tailored for semantic searches across multiple languages.

Loading Dataframes into the Index

Step through the process of indexing documents and generating embeddings.

Performing Semantic k-NN Searches

Discover how to execute k-NN searches that utilize the indexed data.

Clean Up

Instructions for terminating resources to avoid unnecessary charges.

Benefits of Using the ML Connector

Explore the advantages of integrating ML connectors in OpenSearch for enhanced functionality and efficiency.

Conclusion

Recap the key benefits and encourage experimentation with the provided GitHub resources for further exploration.

About the Authors

Brief bios of the contributors, highlighting their expertise in AWS and data analytics.

Augmenting Data in Amazon OpenSearch with Third-Party ML Connectors

When working with Amazon OpenSearch, augmenting your data before ingesting it is a common requirement. This is especially true in scenarios where you want to enrich log files with geographic information based on IP addresses or identify the languages of customer comments. Typically, this enrichment process involves external processes that can complicate data pipelines and lead to failures. However, OpenSearch provides a robust solution through various third-party machine learning (ML) connectors, streamlining the process.

This blog post highlights two powerful ML connectors: Amazon Comprehend and Amazon Bedrock.

Using the Amazon Comprehend Connector for Language Detection

The first connector we’ll explore is the Amazon Comprehend connector. By utilizing this connector, you can easily invoke the LangDetect API to determine the languages of your ingested documents.

Solution Overview

To illustrate the language detection capability, we will use Amazon OpenSearch alongside Amazon Comprehend. We’ve provided the necessary source code, an Amazon SageMaker notebook, and an AWS CloudFormation template in the sample-opensearch-ml-rest-api GitHub repository.

(Insert architecture diagram)

Prerequisites

Before running the full demo, ensure you have an AWS account that grants access to the necessary services.

Part 1: Setting Up the Amazon Comprehend ML Connector

Enabling Access to Amazon Comprehend

To allow OpenSearch to make calls to Amazon Comprehend, you need an IAM role with permissions to invoke the DetectDominantLanguage API. The CloudFormation template creates this role, aptly named --SageMaker-OpenSearch-demo-role. Follow these steps to link the role to your OpenSearch cluster:

Open the OpenSearch Dashboard and sign in.
Navigate to Security > Roles.
Search for ml_full_access and select the Mapped Users link.
Add the ARN for the IAM role you created, allowing OpenSearch to interface with the necessary AWS resources.

Configuring the ML Connector

Next, set up the ML connector to connect OpenSearch to Amazon Comprehend. Collect an authorization token using your IAM role, then configure the connector as follows:

awsauth = AWS4Auth(credentials.access_key, credentials.secret_key, region, 'es', session_token=credentials.token)

payload = {
    "name": "Comprehend lang identification",
    "description": "comprehend model",
    "version": 1,
    "protocol": "aws_sigv4",
    "credential": {
        "roleArn": sageMakerOpenSearchRoleArn
    },
    "parameters": {
        "region": "us-east-1",
        "service_name": "comprehend",
        "api_name": "DetectDominantLanguage",
        "api": "Comprehend_${parameters.api_version}.${parameters.api_name}",
        "response_filter": "$"
    },
    "actions": [...]
}

comprehend_connector_response = requests.post(url, auth=awsauth, json=payload)

Registering the Amazon Comprehend API Connector

Once the connector is set up, register it with OpenSearch:

payload = {
    "name": "comprehend lang id API",
    "function_name": "remote",
    "description": "API to detect the language of text",
    "connector_id": comprehend_connector
}

Testing the Amazon Comprehend API

After registration, test the API:

payload = {
    "parameters": {
        "Text": "你知道厕所在哪里吗"
    }
}

Expected output will show the language code as zh with a high score:

{
   "inference_results":[
      {
         "output":[
            {
               "name":"response",
               "dataAsMap":{
                  "response":{
                     "Languages":[
                        {
                           "LanguageCode":"zh",
                           "Score":1.0
                        }
                     ]
                  }
               }
            }
         ],
         "status_code":200
      }
   ]
}

Creating an Ingest Pipeline

Set up an OpenSearch ingest pipeline that utilizes the Amazon Comprehend API to annotate the language of your documents.

{
  "description": "ingest identify lang with the comprehend API",
  "processors":[
    {
      "ml_inference": {
        "model_id": comprehend_model_id,
        "input_map": [
            {
               "Text": "Text"
            }
        ],
        "output_map": [
            {  
               "detected_language": "response.Languages[0].LanguageCode",
               "language_score": "response.Languages[0].Score"
            }
        ]
      }
    }
  ]
}

Part 2: Utilizing the Amazon Bedrock ML Connector for Semantic Search

Next, we will demonstrate how to enhance OpenSearch capabilities using the Amazon Bedrock connector to access the Amazon Titan Text Embeddings v2 model.

Overview of Amazon Bedrock

Amazon Bedrock provides an easy interface to various powerful AI foundation models, including those from Amazon and industry leaders. This allows you to customize models for your specific needs while adhering to security and responsible AI practices.

Steps to Set Up the Amazon Bedrock Connector

Creating the OpenSearch ML Connector: Similar to the Comprehend connector, define the parameters and setup for Amazon Bedrock.
Creating an Index: Configure the index to accommodate sentence vectors and related data.
Setting Up the Ingest Pipeline: Streamline data processing using OpenSearch ingestion capabilities.

Example of Indexing Real Data

You can use Pandas to load sentences from language-specific JSON files and prepare them for indexing:

import json
import pandas as pd

def load_sentences(file_name):
    sentences = []
    with open(file_name, 'r', encoding='utf-8') as file:
        for line in file:
            try:
                data = json.loads(line)
                sentences.append({
                    'sentence': data['sentence'],
                    'sentence_english': data['sentence_english']
                })
            except json.JSONDecodeError:
                continue
    return pd.DataFrame(sentences)

german_df = load_sentences('german.json')

Performing Semantic k-NN Searches

Once the data is indexed, you can run K-NN searches to find semantically similar sentences across multiple languages.

search_query = {
    "query": {
        "knn": {
            "sentence_vector": {
                "vector": query_vector,
                "k": 30
            }
        }
    }
}

This will allow you to retrieve relevant documents based on the context and language of your query.

Conclusion

By leveraging the OpenSearch ML connectors for both Amazon Comprehend and Amazon Bedrock, you can significantly enhance your data ingestion process, making it easier to integrate powerful ML capabilities directly into your data pipeline.

For more hands-on implementation details, be sure to visit the GitHub repository and explore the full demo.

About the Authors

John Trollinger – Principal Solutions Architect specializing in OpenSearch and Data Analytics.

Shwetha Radhakrishnan – Solutions Architect focused on Data Analytics & Machine Learning at AWS.

By utilizing OpenSearch’s ML connectors, not only can you simplify your architecture and reduce operational costs, but you also gain the flexibility to handle complex ML use cases efficiently. Happy analyzing!

Exclusive Content:

Leveraging Amazon OpenSearch ML Connector APIs

Enhancing Data Ingestion in Amazon OpenSearch with Machine Learning Connectors

Introduction

Solution Overview

Prerequisites

Part 1: The Amazon Comprehend ML Connector

Setting Up OpenSearch for Amazon Comprehend

Setting Up the ML Connector

Registering the Amazon Comprehend API Connector

Testing the Amazon Comprehend API

Creating an Ingest Pipeline for Language Annotation

Part 2: The Amazon Bedrock ML Connector

Load Sentences from JSON Documents

Creating the OpenSearch ML Connector to Amazon Bedrock

Testing the Amazon Titan Embeddings Model

Creating the Index Pipeline with Titan Embeddings

Creating an Index

Loading Dataframes into the Index

Performing Semantic k-NN Searches

Clean Up

Benefits of Using the ML Connector

Conclusion

About the Authors

Augmenting Data in Amazon OpenSearch with Third-Party ML Connectors

Using the Amazon Comprehend Connector for Language Detection

Solution Overview

Prerequisites

Part 1: Setting Up the Amazon Comprehend ML Connector

Enabling Access to Amazon Comprehend

Configuring the ML Connector

Registering the Amazon Comprehend API Connector

Testing the Amazon Comprehend API

Creating an Ingest Pipeline

Part 2: Utilizing the Amazon Bedrock ML Connector for Semantic Search

Overview of Amazon Bedrock

Steps to Set Up the Amazon Bedrock Connector

Example of Indexing Real Data

Performing Semantic k-NN Searches

Conclusion

About the Authors

Latest

Don't miss

Popular categories

Most recent

Most popular

Subscribe