Introducing SeeAct: A Generalist Web Agent Framework
SeeAct is a cutting-edge system designed to operate as a generalist web agent capable of autonomously performing tasks on various websites. It leverages large multimodal models (LMMs) like GPT-4V(ision) to make web interactions more intuitive and efficient. The project stands out by providing robust tools and frameworks that utilize these advanced models for web-based tasks.
Key Components
SeeAct comprises two primary components:
-
Codebase for Web Agents: This powerful codebase supports running web agents on live sites. It serves as a bridge between the user and the browser, automating interactions effectively and evaluating performance on live webpages.
-
Framework Utilizing LMMs: It employs LMMs as generalist agents in the digital space. These models are designed to understand and perform a variety of tasks, enhancing efficiency across diverse online environments.
Recent Updates
SeeAct is continually evolving with recent improvements including:
- Crawler mode to navigate through websites more freely.
- Support for various grounding strategies enhancing task execution.
- Compatibility with new models like Gemini and LLaVA.
- Acceptance into ICML'24, indicating its contribution to the machine learning community.
Setting Up and Using SeeAct
Getting started with SeeAct involves creating a conda environment and installing relevant dependencies. Playwright, a tool for automating web interactions, is integral to SeeAct's functionality, allowing web agents to simulate user activities like clicking and typing.
Features and Usage Modes
- Demo Mode: A mode for testing task execution by inputting a task and a website directly.
- Auto Mode: Allows automated task execution based on predefined lists, enabling batch processing.
- Crawler Mode: Introduces a novel way for the agent to explore links and pages autonomously.
Safety and Monitoring
As a research tool, SeeAct emphasizes safety and monitoring. It includes settings that require user confirmation before executing potentially impactful actions, ensuring users can inspect and manage the agent's decisions before they are finalized.
Multimodal-Mind2Web Dataset
SeeAct features the Multimodal-Mind2Web dataset, aligning HTML documents with corresponding webpage screenshots. This dataset supports diverse data splits and helps ensure the accuracy of the agent's actions.
Experimentation and Developments
Users can experiment with generating visual data from Mind2Web raw dumps and perform online evaluations. Additionally, SeeAct offers configuration flexibility through TOML files, allowing customization for specific research needs.
Licensing
The SeeAct project and its associated datasets are offered under Open RAIL licenses, emphasizing its commitment to open and transparent research.
Community and Support
The SeeAct development team welcomes collaboration and discussion. For inquiries, reaching out via email or Github issues is encouraged.
By continually refining the interaction between AI and web technologies, SeeAct aims to advance how automated agents can perform complex tasks on the web efficiently and safely.