evals - Evaluate Language Models with Customizable Testing Framework

Introduction to OpenAI Evals

OpenAI Evals provides an essential framework for assessing large language models (LLMs) or systems developed using these models. It offers a comprehensive registry of evaluation tools, enabling users to test various aspects of OpenAI models. Moreover, users have the flexibility to create custom evaluations tailored to their specific use cases, including private evaluations using personal data, ensuring data security and privacy.

Creating high-quality evaluations is one of the most impactful steps for those developing with LLMs. Evaluations help users understand how different versions of models might influence their projects, saving time and effort in the long run. As stated by Greg Brockman, President of OpenAI, evaluations are crucial for any project involving LLMs.

Setup and Requirements

To run OpenAI Evals, users need an OpenAI API key, which can be set up by following instructions available on the OpenAI platform. Awareness of the associated costs when using the API is crucial. Additionally, evaluations can also be run and created using Weights & Biases for those inclined.

The minimum requirement for running the framework is Python 3.9. The evals registry, stored using Git-LFS, can be fetched and downloaded locally to access the evaluation data pertinent to users' needs.

Creating and Running Evals

Users interested in creating evaluations should clone the repository directly from GitHub. By installing the necessary requirements through the command line, any changes made to evaluations can be reflected immediately without needing reinstallation. Additional tools such as formatters can be installed for pre-committing hooks to automate tasks on every commit.

For those who simply wish to run existing evaluations without contributing new ones, the evals package can be easily installed via pip. Detailed instructions are available in the documentation, providing templates and protocols for advanced usage scenarios like prompt chains or tool-using agents. For optional logging, eval results can be sent to a Snowflake database, should the user choose to set one up.

Writing Custom Evals

Beginners are encouraged to walk through available documentation which details the process of building an evaluation. Examples of custom evaluation logic and completion functions are available to help users start their projects. However, it should be noted that evaluations containing custom code are not currently accepted for public contribution. Instead, model-graded evals can be submitted using custom YAML files.

If users believe they have developed an interesting evaluation, they can submit a pull request. OpenAI actively reviews submitted evaluations for potential integration into future model improvements.

Frequently Asked Questions (FAQ)

The extensive FAQ section provides users with examples of evaluations from start to finish and showcases evaluations implemented using diverse methodologies. It also addresses common concerns, such as what to do if an evaluation process hangs at the end.

For those who are proficient in prompt engineering and may not wish to code, the framework offers existing templates to help create evaluations without writing any code. Users can simply provide their data in JSON format and define evaluation parameters in a YAML file. The documentation and supplementary Jupyter notebooks serve as valuable resources to guide users through the process.

Disclaimer

Contributions to the OpenAI Evals project are subject to the MIT license, and contributors are required to ensure adequate rights to any data they upload. OpenAI reserves the right to use this data for future improvements of its services. All contributions must adhere to OpenAI's Usage Policies.