OLMo-Eval: A Comprehensive Guide
OLMo-Eval is a powerful repository designed for the evaluation of open language models. It offers a structured way to understand how different models perform on natural language processing (NLP) tasks, ensuring that researchers can gather consistent and meaningful insights.
Overview
The olmo_eval
framework provides an evaluation pipeline that allows users to run various language models through a series of NLP tasks. This system is highly extensible, meaning it can be easily adapted and expanded to include different tasks and models. The framework includes predefined task_sets
and configuration examples, simplifying the setup process.
One of the key features of OLMo-Eval is its ability to evaluate multiple models (m models) across various task sets (t task_sets). Each task set may comprise several individual tasks, making it possible to derive aggregate metrics for a comprehensive performance review. Users also have the option to integrate with a Google Sheet for reporting results, making it easy to share findings.
This pipeline is built on robust libraries like ai2-tango and ai2-catwalk, ensuring a reliable and flexible evaluation process.
Installation
To get started with OLMo-Eval, clone the repository and set up the environment with the following commands:
conda create -n eval-pipeline python=3.10
conda activate eval-pipeline
cd OLMo-Eval
pip install -e .
This setup will prepare your system to use OLMo-Eval effectively.
Quickstart
Current task configurations can be found in the [configs/task_sets](configs/task_sets)
directory. For an introductory example, you can run gen_tasks
on the model EleutherAI/pythia-1b
using an example configuration file located at [configs/example_config.jsonnet](configs/example_config.jsonnet)
.
Execute the configuration with:
tango --settings tango.yml run configs/example_config.jsonnet --workspace my-eval-workspace
This command will process all tasks defined in the configuration and store the results in a local tango
workspace named my-eval-workspace
. You can then add new models or tasks to the config, and the system will efficiently reuse existing outputs, only generating new ones as necessary.
Loading Pipeline Output
To access the results of your evaluations, use the following Python snippet:
from tango import Workspace
workspace = Workspace.from_url("local://my-eval-workspace")
result = workspace.step_result("combine-all-outputs")
For detailed per-task results:
result = workspace.step_result("outputs_pythia-1bstep140000_gen_tasks_drop")
Evaluating Common Models on Standard Benchmarks
OLMo-Eval allows the assessment of well-known models like falcon-7b
, mpt-7b
, llama2-7b
, and llama2-13b
against standard benchmarks. Configuration files for these evaluations are available, and the process can be initiated by:
tango --settings tango.yml run configs/eval_table.jsonnet --workspace my-eval-workspace
PALOMA
The OLMo-Eval repository was instrumental in conducting evaluations for the PALOMA paper. Detailed instructions for running these evaluations are available within the paloma/README.md
file.
Advanced Features
For users seeking sophisticated features, OLMo-Eval offers:
- Saving outputs to a Google Sheet for easy sharing.
- Using a remote workspace for distributed evaluations.
- Running evaluations without Tango, useful for debugging.
- Deploying tasks on Beaker, a cloud-based research platform.
These advanced capabilities make OLMo-Eval a versatile and powerful tool for researchers and developers working with language models.