Announcing the Day Zero Availability of NVIDIA Nemotron 3 Nano Omni on Amazon SageMaker JumpStart: Transforming Multimodal Intelligence for Enterprises
Overview of NVIDIA Nemotron 3 Nano Omni
Key Features and Architecture
Unlocking Enterprise Use Cases
Step-by-Step Guide to Deployment on SageMaker JumpStart
Inference Capabilities and Example Use Cases
Recommended Parameters for Optimal Performance
Conclusion: Elevate Your Applications with Nemotron 3 Nano Omni
About the Authors
Announcing the Day Zero Availability of NVIDIA Nemotron 3 Nano Omni on Amazon SageMaker JumpStart
We’re thrilled to announce an exciting advancement in multimodal AI capabilities: the day zero availability of the NVIDIA Nemotron 3 Nano Omni on Amazon SageMaker JumpStart! This state-of-the-art model integrates the understanding of video, audio, images, and text into a single, efficient architecture, empowering enterprise customers to craft intelligent applications that can "see," "hear," and "reason" across different modalities in one inference pass.
In this post, we’ll guide you through the architecture and key features of Nemotron 3 Nano Omni, delve into the enterprise applications it enables, and provide insights on how to deploy and run inference using Amazon SageMaker JumpStart.
Overview of NVIDIA Nemotron 3 Nano Omni
The NVIDIA Nemotron 3 Nano Omni is an open, multimodal large language model boasting 30 billion total parameters with 3 billion active parameters. Built on a Mamba2 Transformer Hybrid Mixture of Experts (MoE) architecture, it integrates three main components:
- Nemotron 3 Nano LLM: Serves as the backbone for language processing.
- CRADIO v4-H: Functions as the vision encoder for understanding images and videos.
- Parakeet: Acts as the speech encoder for audio transcription and comprehension.
This unified architecture allows the model to process various input types—video, audio, images, and text—while generating text output. With a significant 131K token context length, support for chain-of-thought reasoning, tool calling, and JSON output along with word-level timestamps for transcription tasks, the model operates at FP8 precision on SageMaker JumpStart, ensuring an optimal mix of accuracy and efficiency for enterprise workloads. It is also accessible under the NVIDIA Open Model Agreement for commercial use.
Transforming Enterprise Agent Workflows
Modern enterprise workflows are inherently multimodal. Agents must interpret screens, documents, audio, video, and text often within the same reasoning loop. Traditionally, organizations have relied on separate models for vision, speech, and language, leading to increased latency, complicated orchestration, fragmented context, and escalating costs.
Nemotron 3 Nano Omni addresses these challenges by serving as the multimodal perception and context sub-agent in a larger architectural ecosystem. It effectively equips agents with the ability to read screens, interpret documents, transcribe speech, and analyze video, all while maintaining a unified multimodal context across reasoning loops.
Input Capabilities
The model supports an array of input types:
| Input Type | Supported Formats | Constraints |
|---|---|---|
| Video | mp4 | Up to 2 minutes, 256 frames max |
| Audio | wav, mp3 | Up to 1 hour, 8kHz+ sampling rate |
| Image | JPEG, PNG (RGB) | Standard resolution |
| Text | String | Up to 131K context |
Enterprise Use Cases
The multimodal prowess of Nemotron 3 Nano Omni empowers various enterprise applications:
- Computer Use Agents: Powering the perception loop for agents navigating user interfaces, it simplifies operations across incident management dashboards, browser automation, and more.
- Document Intelligence: Capable of interpreting complex documents, charts, and mixed media, it enhances workflows for compliance involving contracts and financial documents.
- Audio and Video Understanding: In customer service or research workflows, it maintains continuous context across audio and video, enriching applications from meeting analysis to verification tasks.
Getting Started with SageMaker JumpStart
Deploying Nemotron 3 Nano Omni with Amazon SageMaker JumpStart is straightforward. You can do this in just a few steps:
Deploy via SageMaker Studio
- Open Amazon SageMaker Studio.
- In the left navigation panel, select JumpStart.
- Search for Nemotron 3 Nano Omni.
- Choose the model card and click Deploy.
- Configure instance settings and click Deploy to create the endpoint.
Deploy via SageMaker Python SDK
For programmatic deployment, use the SageMaker Python SDK as follows:
from sagemaker.jumpstart.model import JumpStartModel
model = JumpStartModel(
model_id="huggingface-vlm-nvidia-nemotron3-nano-omni-30ba3b-reasoning-fp8",
role="",
)
predictor = model.deploy(
accept_eula=True,
)
Running Inference
Once deployed, you can send multimodal requests to the endpoint, starting with image understanding:
import base64
def encode_image(image_path):
with open(image_path, "rb") as f:
return base64.b64encode(f.read()).decode("utf-8")
image_b64 = encode_image("example.jpg")
payload = {
"messages": [{
"role": "user",
"content": [{"type": "text", "text": "Describe this image in detail."},
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_b64}"}}],
}],
"max_tokens": 1024,
"temperature": 0.2,
}
response = predictor.predict(payload)
print(response["choices"][0]["message"]["content"])
Recommended Inference Parameters
Depending on your use case, certain hyperparameter settings may enhance your results:
| Mode | Temperature | top_p | max_tokens | Use Case |
|---|---|---|---|---|
| Thinking | 0.6 | 0.95 | 20480 | Complex reasoning |
| Instruct | 0.2 | N/A | 1024 | General tasks, ASR |
For reasoning and complex tasks, enable Thinking mode; for straightforward requests, use Instruct mode for quicker responses.
Conclusion
The NVIDIA Nemotron 3 Nano Omni transitions multimodal AI to another level within Amazon SageMaker JumpStart. By unifying video, audio, image, and text understanding into a single efficient model, it streamlines the development of agentic applications while delivering cutting-edge accuracy and efficiency.
Start leveraging Nemotron 3 Nano Omni for your next intelligent application by deploying it today from Amazon SageMaker JumpStart. For more details, visit the NVIDIA Nemotron model page on Hugging Face.
About the Authors
Dan Ferguson is a Solutions Architect at AWS based in New York, focusing on supporting customers in efficient and sustainable ML integration.
Malav Shastri is a Software Development Engineer at AWS, positioned on the Amazon SageMaker JumpStart and Amazon Bedrock teams, enhancing access to state-of-the-art models.
Vivek Gangasani leads Solutions Architecture for SageMaker Inference, focusing on deploying and optimizing generative AI models and workflows.
By harnessing the power of Nemotron 3 Nano Omni, enterprises can finally achieve the comprehensive, multimodal intelligence their agents deserve. Don’t miss this opportunity to revolutionize your AI capabilities!