Enhancing Data Ingestion in Amazon OpenSearch with Machine Learning Connectors
Introduction
When augmenting data for Amazon OpenSearch, various challenges arise. This post explores how to address these through two key third-party ML connectors.
Solution Overview
Utilizing Amazon Comprehend with OpenSearch, this section outlines how to set up the infrastructure and necessary resources.
Prerequisites
Ensure you have the required AWS account access to implement the solution effectively.
Part 1: The Amazon Comprehend ML Connector
Setting Up OpenSearch for Amazon Comprehend
Learn how to link OpenSearch with Amazon Comprehend by configuring IAM roles correctly for API access.
Setting Up the ML Connector
Follow the steps to establish the connection between OpenSearch and Amazon Comprehend.
Registering the Amazon Comprehend API Connector
Understand how to register the API connector within OpenSearch.
Testing the Amazon Comprehend API
Validate the setup by invoking the API and inspecting the results.
Creating an Ingest Pipeline for Language Annotation
Learn how to create a pipeline that leverages Amazon Comprehend’s capabilities during data ingestion.
Part 2: The Amazon Bedrock ML Connector
Load Sentences from JSON Documents
Discover how to load and structure data from JSON files for processing.
Creating the OpenSearch ML Connector to Amazon Bedrock
Establish a connector to access Amazon Bedrock’s Titan embeddings.
Testing the Amazon Titan Embeddings Model
Verify the proper configuration by testing the embedding model.
Creating the Index Pipeline with Titan Embeddings
Learn about setting up a pipeline designed to work with the Titan embeddings.
Creating an Index
Details about how to create an index tailored for semantic searches across multiple languages.
Loading Dataframes into the Index
Step through the process of indexing documents and generating embeddings.
Performing Semantic k-NN Searches
Discover how to execute k-NN searches that utilize the indexed data.
Clean Up
Instructions for terminating resources to avoid unnecessary charges.
Benefits of Using the ML Connector
Explore the advantages of integrating ML connectors in OpenSearch for enhanced functionality and efficiency.
Conclusion
Recap the key benefits and encourage experimentation with the provided GitHub resources for further exploration.
About the Authors
Brief bios of the contributors, highlighting their expertise in AWS and data analytics.
Augmenting Data in Amazon OpenSearch with Third-Party ML Connectors
When working with Amazon OpenSearch, augmenting your data before ingesting it is a common requirement. This is especially true in scenarios where you want to enrich log files with geographic information based on IP addresses or identify the languages of customer comments. Typically, this enrichment process involves external processes that can complicate data pipelines and lead to failures. However, OpenSearch provides a robust solution through various third-party machine learning (ML) connectors, streamlining the process.
This blog post highlights two powerful ML connectors: Amazon Comprehend and Amazon Bedrock.
Using the Amazon Comprehend Connector for Language Detection
The first connector we’ll explore is the Amazon Comprehend connector. By utilizing this connector, you can easily invoke the LangDetect API to determine the languages of your ingested documents.
Solution Overview
To illustrate the language detection capability, we will use Amazon OpenSearch alongside Amazon Comprehend. We’ve provided the necessary source code, an Amazon SageMaker notebook, and an AWS CloudFormation template in the sample-opensearch-ml-rest-api GitHub repository.
(Insert architecture diagram)
Prerequisites
Before running the full demo, ensure you have an AWS account that grants access to the necessary services.
Part 1: Setting Up the Amazon Comprehend ML Connector
Enabling Access to Amazon Comprehend
To allow OpenSearch to make calls to Amazon Comprehend, you need an IAM role with permissions to invoke the DetectDominantLanguage API. The CloudFormation template creates this role, aptly named --SageMaker-OpenSearch-demo-role. Follow these steps to link the role to your OpenSearch cluster:
- Open the OpenSearch Dashboard and sign in.
- Navigate to Security > Roles.
- Search for ml_full_access and select the Mapped Users link.
- Add the ARN for the IAM role you created, allowing OpenSearch to interface with the necessary AWS resources.
Configuring the ML Connector
Next, set up the ML connector to connect OpenSearch to Amazon Comprehend. Collect an authorization token using your IAM role, then configure the connector as follows:
awsauth = AWS4Auth(credentials.access_key, credentials.secret_key, region, 'es', session_token=credentials.token)
payload = {
"name": "Comprehend lang identification",
"description": "comprehend model",
"version": 1,
"protocol": "aws_sigv4",
"credential": {
"roleArn": sageMakerOpenSearchRoleArn
},
"parameters": {
"region": "us-east-1",
"service_name": "comprehend",
"api_name": "DetectDominantLanguage",
"api": "Comprehend_${parameters.api_version}.${parameters.api_name}",
"response_filter": "$"
},
"actions": [...]
}
comprehend_connector_response = requests.post(url, auth=awsauth, json=payload)
Registering the Amazon Comprehend API Connector
Once the connector is set up, register it with OpenSearch:
payload = {
"name": "comprehend lang id API",
"function_name": "remote",
"description": "API to detect the language of text",
"connector_id": comprehend_connector
}
Testing the Amazon Comprehend API
After registration, test the API:
payload = {
"parameters": {
"Text": "你知道厕所在哪里吗"
}
}
Expected output will show the language code as zh with a high score:
{
"inference_results":[
{
"output":[
{
"name":"response",
"dataAsMap":{
"response":{
"Languages":[
{
"LanguageCode":"zh",
"Score":1.0
}
]
}
}
}
],
"status_code":200
}
]
}
Creating an Ingest Pipeline
Set up an OpenSearch ingest pipeline that utilizes the Amazon Comprehend API to annotate the language of your documents.
{
"description": "ingest identify lang with the comprehend API",
"processors":[
{
"ml_inference": {
"model_id": comprehend_model_id,
"input_map": [
{
"Text": "Text"
}
],
"output_map": [
{
"detected_language": "response.Languages[0].LanguageCode",
"language_score": "response.Languages[0].Score"
}
]
}
}
]
}
Part 2: Utilizing the Amazon Bedrock ML Connector for Semantic Search
Next, we will demonstrate how to enhance OpenSearch capabilities using the Amazon Bedrock connector to access the Amazon Titan Text Embeddings v2 model.
Overview of Amazon Bedrock
Amazon Bedrock provides an easy interface to various powerful AI foundation models, including those from Amazon and industry leaders. This allows you to customize models for your specific needs while adhering to security and responsible AI practices.
Steps to Set Up the Amazon Bedrock Connector
-
Creating the OpenSearch ML Connector: Similar to the Comprehend connector, define the parameters and setup for Amazon Bedrock.
-
Creating an Index: Configure the index to accommodate sentence vectors and related data.
- Setting Up the Ingest Pipeline: Streamline data processing using OpenSearch ingestion capabilities.
Example of Indexing Real Data
You can use Pandas to load sentences from language-specific JSON files and prepare them for indexing:
import json
import pandas as pd
def load_sentences(file_name):
sentences = []
with open(file_name, 'r', encoding='utf-8') as file:
for line in file:
try:
data = json.loads(line)
sentences.append({
'sentence': data['sentence'],
'sentence_english': data['sentence_english']
})
except json.JSONDecodeError:
continue
return pd.DataFrame(sentences)
german_df = load_sentences('german.json')
Performing Semantic k-NN Searches
Once the data is indexed, you can run K-NN searches to find semantically similar sentences across multiple languages.
search_query = {
"query": {
"knn": {
"sentence_vector": {
"vector": query_vector,
"k": 30
}
}
}
}
This will allow you to retrieve relevant documents based on the context and language of your query.
Conclusion
By leveraging the OpenSearch ML connectors for both Amazon Comprehend and Amazon Bedrock, you can significantly enhance your data ingestion process, making it easier to integrate powerful ML capabilities directly into your data pipeline.
For more hands-on implementation details, be sure to visit the GitHub repository and explore the full demo.
About the Authors
John Trollinger – Principal Solutions Architect specializing in OpenSearch and Data Analytics.
Shwetha Radhakrishnan – Solutions Architect focused on Data Analytics & Machine Learning at AWS.
By utilizing OpenSearch’s ML connectors, not only can you simplify your architecture and reduce operational costs, but you also gain the flexibility to handle complex ML use cases efficiently. Happy analyzing!