SeeAct - Streamlined Web Task Performance Using Large-Scale Multimodal AI Models

Introducing SeeAct: A Generalist Web Agent Framework

SeeAct is a cutting-edge system designed to operate as a generalist web agent capable of autonomously performing tasks on various websites. It leverages large multimodal models (LMMs) like GPT-4V(ision) to make web interactions more intuitive and efficient. The project stands out by providing robust tools and frameworks that utilize these advanced models for web-based tasks.

Key Components

SeeAct comprises two primary components:

Codebase for Web Agents: This powerful codebase supports running web agents on live sites. It serves as a bridge between the user and the browser, automating interactions effectively and evaluating performance on live webpages.
Framework Utilizing LMMs: It employs LMMs as generalist agents in the digital space. These models are designed to understand and perform a variety of tasks, enhancing efficiency across diverse online environments.

Recent Updates

SeeAct is continually evolving with recent improvements including:

Crawler mode to navigate through websites more freely.
Support for various grounding strategies enhancing task execution.
Compatibility with new models like Gemini and LLaVA.
Acceptance into ICML'24, indicating its contribution to the machine learning community.

Setting Up and Using SeeAct

Getting started with SeeAct involves creating a conda environment and installing relevant dependencies. Playwright, a tool for automating web interactions, is integral to SeeAct's functionality, allowing web agents to simulate user activities like clicking and typing.

Features and Usage Modes

Demo Mode: A mode for testing task execution by inputting a task and a website directly.
Auto Mode: Allows automated task execution based on predefined lists, enabling batch processing.
Crawler Mode: Introduces a novel way for the agent to explore links and pages autonomously.

Safety and Monitoring

As a research tool, SeeAct emphasizes safety and monitoring. It includes settings that require user confirmation before executing potentially impactful actions, ensuring users can inspect and manage the agent's decisions before they are finalized.

Multimodal-Mind2Web Dataset

SeeAct features the Multimodal-Mind2Web dataset, aligning HTML documents with corresponding webpage screenshots. This dataset supports diverse data splits and helps ensure the accuracy of the agent's actions.

Experimentation and Developments

Users can experiment with generating visual data from Mind2Web raw dumps and perform online evaluations. Additionally, SeeAct offers configuration flexibility through TOML files, allowing customization for specific research needs.

Licensing

The SeeAct project and its associated datasets are offered under Open RAIL licenses, emphasizing its commitment to open and transparent research.

Community and Support

The SeeAct development team welcomes collaboration and discussion. For inquiries, reaching out via email or Github issues is encouraged.

By continually refining the interaction between AI and web technologies, SeeAct aims to advance how automated agents can perform complex tasks on the web efficiently and safely.