AgentTuning - Facilitating LLMs' Robust Generalization in Diverse Agent Tasks

AgentTuning: Enabling Generalized Agent Abilities For LLMs

AgentTuning is an innovative project designed to enhance the capabilities of large language models (LLMs) by instruction-tuning them using interaction trajectories across various agent tasks. The initiative's primary aim is to endow these models with robust generalization abilities, enabling them to perform well even on tasks they have not specifically been trained on, without compromising their inherent language proficiency. The open-sourcing of the AgentInstruct dataset and the release of AgentLM models underscore AgentTuning's commitment to advancing research and collaboration in this field.

Main Result

The project has demonstrated significant improvements in the overall performance of LLMs on both tested (held-in) and novel (held-out) tasks, as highlighted in the evaluation results.

AgentInstruct

AgentInstruct is a comprehensive dataset uniquely crafted to boost AI agent capabilities. It encompasses:

Chain of Thought (CoT): Utilizing the ReAct framework to articulate detailed explanations for each action, which aids in understanding a model's decision-making process.
Diverse Tasks: The dataset covers six varied real-world scenarios, including daily household chores and database management, with interaction lengths varying between 5 and 35 turns.
High Precision: The data is thoroughly filtered through reward mechanisms to ensure only high-quality interaction trajectories of GPT-4 are included.
Assured Quality: Rigorous data quality checks are conducted to avoid any leakage, maintaining the dataset's integrity.

AgentInstruct can be accessed on the Huggingface platform.

AgentLM

AgentLM models are created through combined training on the AgentInstruct dataset and the ShareGPT dataset from the Llama2-chat series. These models adopt the Llama-2-chat's conversational format, with a standard prompt: "You are a helpful, respectful and honest assistant." The available models—which vary in size with parameters like 7B, 13B, and 70B—can be found on Huggingface.

Running AgentLM

AgentLM utilizes Text-Generation-Inference technology to expedite evaluation processes. Users can deploy an instance of AgentLM-70B using Docker, and detailed instructions, including starting the instance and sending a request, are provided for ease of use. The system supports scaling to multiple inference instances if required.

Evaluation

A detailed evaluation has been carried out, involving six held-in tasks and an additional six held-out tasks compiled from various sources.

Held-in Tasks: Evaluated using tasks from the AgentBench framework.
Held-out Tasks: Compiled from frameworks like SciWorld, MiniWoB++, HotpotQA, ReWOO, WebArena, and a digital card game, allowing for extensive testing of the model's adaptability.

General Tasks

AgentTuning also includes setup protocols for general tasks such as MMLU, GSM8k, and MT-Bench. Each setup involves a series of steps for downloading data, running evaluation scripts, and processing results, leveraging platforms and tools like FastChat and GPT-4 for advanced analysis.

Citation

The project encourages citation for those who find AgentTuning beneficial to their work. A citation snippet is provided to acknowledge the efforts of the contributing authors.

AgentTuning represents a remarkable stride towards refining the universality of LLMs, offering a robust framework for future research and practical applications in AI agent interaction.