simple-evals - Simple Language Model Evaluation Using Zero-Shot Techniques

Introduction to Simple-Evals

The simple-evals project is a lightweight library designed to evaluate the performance of language models. This open-source tool is created with transparency in mind, allowing users and developers to assess the accuracy of the models along with the latest releases. By being open-source, it aims to foster collaboration and trust within the community regarding the accuracy claims made about various language models.

Benchmark Results

The benchmark results section provides an overview of different models and their performance across several evaluation tasks. The models, including variations of the GPT-4 series and others like Claude and Llama, are tested against tasks such as MMLU (Massive Multitask Language Understanding) and MATH (Mathematical Problem Solving). For instance, the benchmark results show that models like "o1-preview" excel in the HumanEval task with a score of 92.4. This comprehensive table helps users compare and understand the relative strengths of each model in different areas.

Background

The library emphasizes evaluations under the zero-shot, chain-of-thought setting. This method of prompting models to solve problems without prior examples is seen as more reflective of real-world applications. Contrasting with few-shot or role-playing prompts, this approach is better aligned with models explicitly trained to follow instructions.

Despite its utility, the repository is not actively maintained for new evaluations, focusing only on essential updates like bug fixes and adding adapters for emerging models. The intention isn’t to replace existing comprehensive evaluation tools but to complement them by offering transparency and baselines for new models.

Evals

The evaluations included in simple-evals cover several key tasks:

MMLU: Evaluates the ability to handle a vast array of topics.
MATH: Focuses on solving complex math problems.
GPQA: Assesses the model’s capability to answer difficult, Google-proof questions.
DROP: Tests discrete reasoning over text.
MGSM: Evaluates math problem-solving across multiple languages.
HumanEval: Reviews the model's proficiency in generating code.

These tests are rooted in existing benchmarks, with references and licenses clearly provided to ensure transparency and proper attribution.

Samplers

The library includes interfaces that work with APIs from prominent AI platforms, such as OpenAI and Anthropic’s Claude. These samplers enable the use of simple-evals across different environments and models, provided users set up the correct environment variables.

Setup

Instead of a one-size-fits-all installation, simple-evals provides detailed setup instructions for individual evaluations and samplers. This modular approach ensures that users can configure the library according to their specific needs. For example, users can install the HumanEval evaluation or utilize APIs like OpenAI’s and Anthropic’s Claude through pip commands.

Demo

Simple-evals includes a straightforward demo command, allowing users to launch evaluations via the OpenAI API quickly. This feature provides an easy starting point for users who want to see the library in action without diving deeply into setup complexities.

Conclusion

Simple-evals serves as a transparent and straightforward tool for evaluating the performance of various language models. Its emphasis on realistic evaluation settings, combined with the open-source ethos, makes it a valuable resource for developers and researchers aiming to understand and improve the accuracy of language models.