AGIEval - Comprehensive Evaluation of Foundation Models via Human-Centric Benchmarks

Introduction to AGIEval

AGIEval is a sophisticated benchmark crafted to assess the overarching capabilities of foundational models, particularly focusing on domains related to human thought processes and problem-solving. This benchmark draws inspiration from a range of stringent and publicly accessible admission and qualification examinations. These encompass a variety of tests intended for human candidates such as the Chinese College Entrance Exam (Gaokao), the American SAT, law school entrance tests, mathematics competitions, legal proficiency exams, and national civil service examinations. For comprehensive insights into the benchmark, one can refer to the detailed paper titled AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models.

Tasks and Data

AGIEval's dataset has been updated to version 1.1, bringing new datasets from recent Chinese Gaokao exams in chemistry, biology, and physics, while also addressing previously existing annotation issues. To ensure clarity in the evaluation process, all multiple-choice question (MCQ) tasks are now designed to have only one correct answer. Although the datasets specifically in English remain unchanged from version 1.0, the updated version contains a rich variety of tasks totaling 20, including 18 multiple-choice tasks and two cloze tasks.

The structure of the data format used across the datasets is standardized. Here's a quick overview:

{
    "passage": null,
    "question": "Set A ={x | x ≥ 1}, B ={x | -1 < x < 2}, then A ∩ B = ($\\quad$)\\",
    "options": ["(A) {x | x > -1}", "(B) {x | x ≥ 1}", "(C) {x | -1 < x < 1}", "(D) {x | 1 ≤ x < 2}"],
    "label": "D",
    "answer": null
}

In certain tasks like Gaokao-Chinese, Gaokao-English, LSAT, SAT, and others, the passage field contains informative text to aid the question. For cloze tasks, solutions are registered in the answer field, whereas for MCQ tasks, they are noted in the label field.

Baseline Systems and Evaluation

AGIEval v1.1's effectiveness has been gauged using powerful baseline systems such as GPT-3.5-turbo and GPT-4o. The evaluation results are visually represented in detailed graphics. To replicate these results, users can follow several straightforward steps, including updating the OpenAI API and running specific prediction scripts.

The results of the evaluations determine the standings on the leaderboard, which encompasses two main subsets: AGIEval-en and AGIEval-zh, specifically targeting MCQ tasks.

AGIEval-en few-shot Performance

Top Model: GPT-4o, achieving an average score of 71.4.
Other notable models: Llama 3 400B+ and GPT-3.5-Turbo with varying average scores.

AGIEval-zh few-shot Performance

Top Model: Once again, GPT-4o scores highest at 71.9.
Following Contender: GPT-3.5-Turbo with a 49.5 average score.

Contribution and Community Engagement

AGIEval invites contributions and suggestions from the larger community. Before contributing, individuals may need to agree to a Contributor License Agreement (CLA), ensuring they have the rights to contribute. The project adheres to the Microsoft Open Source Code of Conduct, ensuring a respectful and inclusive environment for all contributors.

Trademarks and Usage

The project may involve usage of specific trademarks or logos, and it is essential to comply with Microsoft's Trademark & Brand Guidelines. Unauthorized or misleading usage is strictly discouraged to prevent any confusion.

AGIEval stands as a significant endeavor in benchmarking human-like cognitive proficiency in AI models, paving the way for more intuitive and versatile artificial intelligence solutions.