Experience the Future: AI-Powered Browser Control at Your Fingertips!

In the past couple of weeks, Amazon AGI SF Lab made a significant leap in the realm of actionable AI agents by unveiling the Nova-Act agent that can control web browsers. This innovative agent leverages Amazon's Nova models as a foundation and introduces a specialized model designed to power an agent system capable of controlling web browsers.

A Focus on Reliability

Unlike other AI agents that aim for broad actions, Nova-Act emphasizes reliability. The agent is meticulously trained to take discrete actions on web browsers with high accuracy. It checks the surroundings of interactions, compares what it sees to expected outcomes, and only proceeds when it is confident that the next action delivered expected results. Reliability is very important with action-based-agents. Imagine an AI agent clicking 'Buy' with the wrong items in your cart. Hallucinations happen, but smart systems are designed to handle them seamlessly.

A diagram of a diagram

AI-generated content may be incorrect.

The Nova-Act SDK

The Nova-Act SDK is a game-changer for developers. It allows users to make calls to the Nova system via Python code with ease. By simply using the nova-act function call in Python, developers can describe the desired actions, and the system will execute them. For example, the following will search Amazon.com for a 64gig USB Drive. It is this easy.

nova.act("search for a 64 gigabyte USB drive")

Key Features

Nova-Act boasts several impressive features:

Common Actions: The agent can navigate through buttons, date fields, and data entry fields on a webpage. It can handle dropdowns, scroll through pages.
User Data Folder: For authentication and cookie support, the agent can reference user data folders.
Security: Built-in security measures prompt users for keyboard entry when necessary.
Structured Data: The agent can search a webpage and retrieve data into a local model for analysis.
Threading: The agent can execute multiple requests across several browser instances simultaneously via threading.
Headless: The agent can perform its work in a headless mode which does not require an active browser window.
File Support: The agent can download files.
Logging and Session Recording: While the agent cannot solve CAPTCHA, it includes traces and logging actions, allowing users to see what happened during an act.

Limitations

Despite its advanced capabilities, Nova-Act has some limitations. It cannot interact with non-browser applications, elements behind a mouse, or the browser window itself. Additionally, it requires direct instructions for higher-level prompts.

The Art of the Possible

Examples within the provided source code show you a variety of outcomes such as an agent that orders your favorite lunch every day, or an agent that collects data about available apartments in an area and outputs them to a formatted list, and an agent that orders a coffee maker for you from Amazon.com. Most are simple but show you important tasks such as the utilization of structured data and authentication.

An ethical question comes into play as you can use the agent to load webpages and scrape the data, even though the website blocks crawlers. When using Nova Act agent on a 3rd party website, be sure to respect the Website's Terms of Service (ToS). Scraping is not itself inherently unethical, but you should be responsible and respect the ToS.

A screenshot of a computer program

AI-generated content may be incorrect.

In less than a few minutes, I wrote an example that searches for a 64 gigabyte USB Drive on Amazon.com. The above figure shows the trace-log from the start of the act execution. The code was only a few lines long but it generated several pages of useful logs for debugging.

The output from the browser window was simple, but shows a very powerful tool where the agent navigated to the site, opened the window, searched for the terms, and shows the result; all fully automated. You could quickly expand the example to retrieve the results into a local data structure and write it to a database for analysis.

Getting Started

To get started with Nova-Act, visit the Nova-Act website and request to be included in the Act Preview. After a few hours, you will typically receive an invitation to the program along with your Nova-Act API Key.

To get started you only need Python 3.10 or above and a MacOS or Ubuntu system. For my testing I used a MacOS laptop. You can follow the guidance to set everything up at https://github.com/aws/nova-act.

From there I suggest cloning the code and trying out the Samples provided after you have read and understand the Disclosures.

What is Next?

I'm excited to invite you on a journey into the world of action-based agent AI in Part 2 of this series where I'll guide you through the code, step-by-step, to help you build a dynamic AI agent. You'll learn how to leverage AI to create responsive and intelligent systems that can revolutionize your projects. Coming soon!

Experience the Future: AI-Powered Browser Control at Your Fingertips!

Technologies