Unlocking Browser Automation: Introducing OS Level Actions for AgentCore Browser

Enhancing Web Workflow Automation Beyond the DOM

Overcoming Browser Limitations: Direct Interaction with Native OS Elements

The Action-Screenshot-Reaction Loop: How OS Level Actions Function

Supported Actions: Expanding Capabilities with Mouse, Keyboard, and Visual Inputs

Practical Applications: Navigating Native UI Challenges in Browser Sessions

Getting Started with OS Level Actions in AgentCore Browser

Conclusion: Transforming Automation Workflows with Comprehensive OS Integration

Meet the Authors: The Innovators Behind OS Level Actions

Unleashing Automation: Introducing OS Level Actions for AgentCore Browser

In the rapidly evolving landscape of web automation, AI agents are playing an increasingly crucial role. These agents automate workflows within the browser’s web layer—specifically, the Document Object Model (DOM) that tools like Playwright and the Chrome DevTools Protocol (CDP) expose. However, as powerful as these tools are, they encounter significant limitations. Native operating system (OS) interfaces—such as security prompts, print dialogs, and context menus—exist beyond the reach of the web layer. They remain invisible to automation scripts relying solely on CDP or Playwright.

To bridge this critical gap, we’re thrilled to announce OS Level Actions for AgentCore Browser, a groundbreaking feature that allows AI agents to interact directly with the OS, enabling seamless interaction with content visible on the screen.

The Challenge of Web Automation

Most web workflows involve navigating pages, filling forms, and clicking elements—all tasks well within the capabilities of frameworks like Playwright. Yet, several scenarios reveal the boundaries of web automation:

Native Dialogues and OS Prompts: When a web application triggers a print dialog via window.print(), Playwright can’t interact with it because it’s outside the DOM. Such scenarios often arise in production, driven by specific application states or user permissions.
Dynamic UI Interactions: Keyboard shortcuts or right-click actions are essential for many automated workflows. Still, CDP lacks the functionality to execute these OS-level commands, leading to dead-ends when automation is crucial.
Vision-Enabled Agents: Common architectures for AI agents involve capturing screenshots to determine the next action. However, once a native UI appears, the agent can "see" what needs to be done but lacks the means to act on it.

Introducing OS Level Actions

OS Level Actions transforms the landscape of web automation by exposing direct OS control through the InvokeBrowser API. This capability allows agents to interact not only with web elements in the DOM but also with native UI components that are essential for fully automated workflows. Here’s how it works:

How OS Level Actions Work

Action Execution: After a session is active, the agent can dispatch actions using the InvokeBrowser API, specifying one action type at a time. Each action will return a SUCCESS or FAILED status, tied to the browser session.
Action-Screenshot-Reaction Loop: This fundamental pattern involves:
- The agent sends an action (like a mouse click or keyboard shortcut).
- AgentCore executes the action at the OS level and confirms with a status.
- The agent captures a full screenshot to analyze the current screen.
- Based on the observation, the agent decides the next action to take.

Supported Actions

OS Level Actions encompass three main categories, offering a variety of capabilities that empower agents:

Mouse Control: Actions include mouse clicks, moves, drags, and scrolling:
- mouseClick: Triggers a click at specified coordinates.
- mouseMove: Moves the cursor to specified coordinates.
- mouseDrag: Drags the cursor from one point to another.
Keyboard Input: Agents can simulate keystrokes for text input, key presses, and shortcuts:
- keyType: Types out a specified string.
- keyPress: Presses a specified key a certain number of times.
- keyShortcut: Simulates key combinations.
Visual Capture: Captures the entire OS desktop as a screenshot, returning it as a base64-encoded PNG.

Getting Started with OS Level Actions

Setting up and implementing OS Level Actions is streamlined. Here are the steps to begin:

Client Setup: Initialize two clients—one for managing browser resources and another for dispatching actions during a session.
Resource Creation: Before starting a session, create an IAM role with the necessary permissions and a browser resource.
Start a Browser Session: Define viewport parameters and timeout settings for managing the session effectively.
Invoke Actions: Using the InvokeBrowser API, send mouse clicks, keystrokes, and take screenshots as necessary.
Observe and React: After every action, the agent must capture the screen state to inform the subsequent steps.

Practical Example: Dismissing a Print Dialog

To illustrate the action-screenshot-reaction loop, consider a scenario where a web app calls window.print(), prompting a print dialog. Despite Playwright’s limitations, OS Level Actions can effectively handle this situation:

Capture Current State: The agent initiates a screenshot to assess the state when the print dialog appears.
Vision Processing: Utilize a vision model to identify the dialog and its elements (e.g., the Cancel button).
Execute Action: The agent triggers a click on the identified button coordinates and takes another screenshot to verify the action’s success.

Conclusion

With the introduction of OS Level Actions for AgentCore Browser, AI agents can now seamlessly interact with both web elements and native OS components. This innovation enhances the capabilities of automation frameworks, effectively removing blockers that have long hindered full workflow automation. As AI and machine learning continue to evolve, OS Level Actions marks a significant leap in bridging the gap between web-based interactions and native system interfaces.

Whether you’re aiming to improve existing workflows or dive into the new features of AgentCore Browser, this development opens the door to previously unattainable automation possibilities.

About the Authors

To learn more about AgentCore Browser and its capabilities, don’t hesitate to reach out. Our team, comprising experienced professionals in AI, data science, and software engineering, is excited to help you navigate your automation journeys.

Exclusive Content:

Introducing OS-Level Actions in the Amazon Bedrock AgentCore Browser