Language Model Evaluation Harness
Introduction
The Language Model Evaluation Harness is a versatile framework developed for testing generative language models across a wide array of evaluation tasks. It is a critical tool in the realm of artificial intelligence, providing a standardized approach to measure and improve the performance of language models.
Latest Updates
Multimodal Input Support
In September 2024, the project began prototyping new capabilities that allow users to create and evaluate tasks involving text and image multimodal inputs, with text outputs. This feature includes the addition of model types like hf-multimodal
and vllm-vlm
, and a prototype task named mmmu
. This initiative opens new avenues for comprehensive model evaluation, and the community is encouraged to test and provide feedback on these new functionalities.
API Model Support Enhancements
In July 2024, significant updates were made to the API model. These changes introduced support for batched and async requests, simplifying customization and usage for diverse applications. For example, users looking to run Llama 405B are advised to leverage VLLM's OpenAI-compliant API for hosting the model and employing the local-completions
model type for assessments.
New Open LLM Leaderboard Tasks
A series of novel tasks have been integrated into the Open LLM Leaderboard, expanding the framework's capability to evaluate new challenges. This extension widens the scope of testing and benchmarking large language models (LLMs).
Key Features
- Extensive Benchmarking: Supports over 60 academic benchmarks with numerous subtasks, enabling comprehensive model evaluation.
- Versatile Model Support: Works seamlessly with models loaded via well-known libraries like
transformers
, GPT-NeoX, and Megatron-DeepSpeed, and supports commercial APIs from providers like OpenAI and TextSynth. - Multimodal Evaluation Support: As part of its developmental roadmap, the framework now explores text and image multimodal input evaluations.
- Custom Prompt and Metric Support: Users can easily introduce custom prompts and evaluation metrics, enhancing the adaptability of the framework to fit diverse research needs.
- Efficient Inference: Offers support for fast and memory-efficient inference, especially when using vLLM, aiding in the efficient handling of large-scale models.
Installation
To install lm-evaluation-harness from the GitHub repository, follow these commands:
git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .
Additional dependencies for expanded functionality are available and recommended as needed.
Basic Usage
The framework provides thorough documentation detailing the full list of supported arguments and usage instructions. Users can start by exploring available tasks using the command lm-eval --tasks list
, or dive into specific evaluations using Hugging Face's transformers
, Nvidia’s NeMo framework, or other supported interfaces.
Example Command
Here’s an example of evaluating a model on the hellaswag
task:
lm_eval --model hf \
--model_args pretrained=EleutherAI/gpt-j-6B \
--tasks hellaswag \
--device cuda:0 \
--batch_size 8
This command assumes a CUDA-compatible GPU for running the evaluation effectively and uses Hugging Face's model repository.
Advanced Features
Multi-GPU and API Support
The framework is designed to utilize multiple GPUs for parallel evaluation, leveraging libraries like Hugging Face's accelerate
. This is critical for handling large-scale models that require partitioned computation across multiple hardware units.
Inference Servers and APIs
It also supports integration with various inference APIs and servers, offering flexible evaluation settings for both locally hosted and remote model interfaces.
Conclusion
The Language Model Evaluation Harness stands as a robust and flexible tool for conducting rigorous evaluations of language models. Its continuous developments and growing community support contribute to the advancing field of AI, offering researchers and developers an indispensable resource for model assessment and development.