AgentBench - Evaluate Visual and Language Agents Across Diverse Contexts

An Introduction to AgentBench

What is AgentBench?

AgentBench is an innovative benchmarking project designed to evaluate the capabilities of large language models, often referred to as LLMs, when acting as autonomous agents in various environments. It aims to assess how effectively these models, which are a key component of artificial intelligence, can perform tasks in different scenarios that require decision-making abilities typically associated with human agents.

The Latest Updates

VisualAgentBench

In August 2024, the introduction of VisualAgentBench aimed to expand the evaluation scope into visual foundation agents. This initiative evaluates large multimodel models (LMMs) in various visually intensive environments:

Embodied: Includes environments like VAB-OmniGibson and VAB-Minecraft.
GUI: Covers areas like VAB-Mobile and VAB-WebArena-Lite.
Visual Design: Incorporates environments such as VAB-CSS.

Also, datasets for behavior cloning training are provided, which helps in developing visual agents.

AgentBench v0.2

AgentBench is currently at version 0.2, with several enhancements from the previous version. These include:

An updated framework architecture to make it more user-friendly and adaptable.
Adjustments in task settings to refine the evaluation process.
Additional test results for a wider array of models.
Availability of complete data for both development and test sets.

Evaluating LLMs as Agents

AgentBench is pioneering in its comprehensive approach to evaluating LLMs across different environments. It tests the ability of these models to function as agents in eight distinct scenarios:

Newly Created Domains: Operating System (OS), Database (DB), Knowledge Graph (KG), Digital Card Game (DCG), and Lateral Thinking Puzzles (LTP).
Recompiled Domains: Include House-Holding tasks (inspired by ALFWorld), Web Shopping (inspired by WebShop), and Web Browsing (inspired by Mind2Web).

These environments offer a diverse set of challenges, ensuring that evaluations can reflect a broad spectrum of potential applications.

Structure and Usage

Quick Start Guide

AgentBench provides a step-by-step guide for evaluating LLMs. It involves setting up the environment, configuring the agent, and starting task servers. Whether you are a beginner or an experienced user, the detailed walkthrough ensures easy integration and testing.

Next Steps and Extensions

The future involves introducing more tasks and accommodating different models, as detailed in the accompanying guides. For those interested in extending AgentBench, comprehensive documentation is available to add new tasks seamlessly.

Conclusion

AgentBench is at the forefront of evaluating AI capabilities in varied environments. With continuous updates and expansions, it offers a robust platform for testing and developing intelligent agents, pushing the boundaries of what AI can achieve in the realm of autonomous decision-making.