instruct-eval - Full assessment of instruction-tuned models using wide-ranging benchmarks

InstructEval: A Comprehensive Evaluation of Instruction-Focused Language Models

InstructEval is a project that seeks to provide a thorough evaluation of large language models (LLMs) with instruction-based tuning. The goal of this project is to create a framework that facilitates easy and effective benchmarking of various models across different tasks. This project promises to be an invaluable resource for developers and researchers by offering insights into model performance, thereby helping in the development of more efficient and cost-effective LLMs.

Objectives and Importance

The core objective of InstructEval is to evaluate instruction-tuned models like Flan-T5 and Alpaca, which aim to replicate the performance of more powerful LLMs, such as ChatGPT, but at a substantially lower cost. Instruction-tuned models are designed to follow given instructions more faithfully, making them crucial for specialized applications. However, assessing these models consistently across numerous tasks remains challenging. InstructEval addresses this by enabling simple comparisons using academic benchmarks like MMLU (Massive Multitask Language Understanding) and BBH (Big Bench Hard).

Key Features

InstructEval supports a wide array of models available through HuggingFace's Transformers library. This includes models like GPT-2, GPT-J, LLaMA, and many more, providing broader flexibility for evaluation.

AutoModelForCausalLM: Supports models like GPT-2, GPT-J, OPT-IML.
AutoModelForSeq2SeqLM: Includes models such as Flan-T5, Flan-UL2.
LlamaForCausalLM: Features models like LLaMA, Alpaca, Vicuna.
ChatGLM: This is another model supported for evaluation.

Evaluation and Results

The InstructEval platform provides an up-to-date leaderboard showcasing model performances on a variety of benchmarks, including MMLU, BBH, DROP, and HumanEval. For instance, GPT-4 achieves a high score of 86.4 on MMLU, consistently outperforming many other models.

Here’s a look at some models and their performance:

GPT-4: Achieves 86.4 on MMLU and 80.9 on DROP.
ChatGPT: Scores 70.0 on MMLU.
Flan-T5-XXL: Has a 54.5 score on MMLU and 43.9 on BBH.

How to Use InstructEval

Evaluating a chosen model using InstructEval is straightforward. Users can execute a simple command to test models on datasets like MMLU or BBH. For example, evaluating the Alpaca-native model on MMLU provides an easy-to-interpret score that reflects its effectiveness.

python main.py mmlu --model_name llama --model_path chavinlo/alpaca-native

This command assesses the Alpaca-native model's performance in the MMLU task, yielding a quantitative score that adds to understanding the model's strengths and weaknesses.

Setup Instructions

To get started with InstructEval, users need to install necessary dependencies and download relevant datasets. The following steps can be undertaken in a Python environment:

Create and activate a new conda environment:

conda create -n instruct-eval python=3.8 -y
conda activate instruct-eval

Install required packages:
```
pip install -r requirements.txt
```

Setup data directory and download dataset:

mkdir -p data
wget https://people.eecs.berkeley.edu/~hendrycks/data.tar -O data/mmlu.tar
tar -xf data/mmlu.tar -C data && mv data/data data/mmlu

Conclusion

InstructEval serves as a comprehensive and user-friendly tool for evaluating instruction-tuned LLMs. By simplifying the benchmarking process and providing versatile evaluations across a range of models, it equips researchers with the knowledge to improve LLMs and make informed decisions regarding model selection and optimization. This project holds the potential to advance the development of intuitive and effective language models that adhere more accurately to user instructions.