OLMo-Eval - Comprehensive Language Model Evaluation Framework

OLMo-Eval: A Comprehensive Guide

OLMo-Eval is a powerful repository designed for the evaluation of open language models. It offers a structured way to understand how different models perform on natural language processing (NLP) tasks, ensuring that researchers can gather consistent and meaningful insights.

Overview

The olmo_eval framework provides an evaluation pipeline that allows users to run various language models through a series of NLP tasks. This system is highly extensible, meaning it can be easily adapted and expanded to include different tasks and models. The framework includes predefined task_sets and configuration examples, simplifying the setup process.

One of the key features of OLMo-Eval is its ability to evaluate multiple models (m models) across various task sets (t task_sets). Each task set may comprise several individual tasks, making it possible to derive aggregate metrics for a comprehensive performance review. Users also have the option to integrate with a Google Sheet for reporting results, making it easy to share findings.

This pipeline is built on robust libraries like ai2-tango and ai2-catwalk, ensuring a reliable and flexible evaluation process.

Installation

To get started with OLMo-Eval, clone the repository and set up the environment with the following commands:

conda create -n eval-pipeline python=3.10
conda activate eval-pipeline
cd OLMo-Eval
pip install -e .

This setup will prepare your system to use OLMo-Eval effectively.

Quickstart

Current task configurations can be found in the [configs/task_sets](configs/task_sets) directory. For an introductory example, you can run gen_tasks on the model EleutherAI/pythia-1b using an example configuration file located at [configs/example_config.jsonnet](configs/example_config.jsonnet).

Execute the configuration with:

tango --settings tango.yml run configs/example_config.jsonnet --workspace my-eval-workspace

This command will process all tasks defined in the configuration and store the results in a local tango workspace named my-eval-workspace. You can then add new models or tasks to the config, and the system will efficiently reuse existing outputs, only generating new ones as necessary.

Loading Pipeline Output

To access the results of your evaluations, use the following Python snippet:

from tango import Workspace
workspace = Workspace.from_url("local://my-eval-workspace")
result = workspace.step_result("combine-all-outputs")

For detailed per-task results:

result = workspace.step_result("outputs_pythia-1bstep140000_gen_tasks_drop")

Evaluating Common Models on Standard Benchmarks

OLMo-Eval allows the assessment of well-known models like falcon-7b, mpt-7b, llama2-7b, and llama2-13b against standard benchmarks. Configuration files for these evaluations are available, and the process can be initiated by:

tango --settings tango.yml run configs/eval_table.jsonnet --workspace my-eval-workspace

PALOMA

The OLMo-Eval repository was instrumental in conducting evaluations for the PALOMA paper. Detailed instructions for running these evaluations are available within the paloma/README.md file.

Advanced Features

For users seeking sophisticated features, OLMo-Eval offers:

Saving outputs to a Google Sheet for easy sharing.
Using a remote workspace for distributed evaluations.
Running evaluations without Tango, useful for debugging.
Deploying tasks on Beaker, a cloud-based research platform.

These advanced capabilities make OLMo-Eval a versatile and powerful tool for researchers and developers working with language models.