ML-Bench - Evaluation Framework for Large Language Models and Machine Learning Agents

Introduction to ML-Bench: Evaluating Language Models and Agents

ML-Bench is a comprehensive platform designed to evaluate large language models and agents using repository-level code for machine learning tasks. The focus is on measuring how effectively these models operate in real-world coding environments. The project offers a robust framework for testing, tuning, and benchmarking language models across various tasks.

Prerequisites

To get started with ML-Bench, users should clone the repository along with its submodules. This can be accomplished using the --recurse-submodules flag when cloning with Git. This ensures all the necessary components are included. After cloning, install the required dependencies listed in the requirements.txt file using pip.

Data Preparation

ML-Bench provides a specific dataset that can be loaded using Python. This dataset includes a variety of columns such as GitHub repository IDs, URLs, and specific paths to files needed for the tasks. Additionally, it offers instructional content, oracle guidance for task completion, and expected outputs. Users may need to conduct post-processing on the dataset to ready it for use with ML-LLM-Bench, involving simple shell scripting.

ML-LLM-Bench: Language Model Evaluation

Environment Setup

To run ML-LLM-Bench, users can deploy a Docker container for an isolated environment, simplifying model execution and testing. Docker commands are provided for pulling and running the required ML-Bench image. Additionally, scripts are available to download all necessary model weights, which might take some time due to size.

Usage and Execution

Users should place their execution results in a designated directory and modify script parameters to point to the correct paths for input and logs. By executing a provided shell script, testers can begin running evaluations and later view results and logs for detailed insights into model performance.

API Calling and Performance Reproduction

The project provides scripts to test OpenAI's models on ML-Bench tasks. This involves setting various parameters such as model type, input file paths, and choosing execution settings. Users must provide their OpenAI API key to enable these functionalities.

Fine-tuning Open Source Models

ML-Bench supports fine-tuning for open-source models through specific pipelines. It provides scripts to reproduce the performance of models like CodeLlama on given tasks. Fine-tuning involves adjusting models based on task descriptions to generate code snippets effectively.

Inference Process

Inference performance for models can be reproduced using a straightforward script that specifies model names, task details, and input prompt files.

ML-Agent-Bench: Agent Evaluation

For evaluating agent-based systems, ML-Agent-Bench offers a similar Docker-based setup. This ensures consistent evaluation environments and requires pulling and running pre-configured Docker images. Additional setup guidance is available for those using the OpenDevin system.

Licensing

ML-Bench is shared under the MIT License, providing freedom to use, modify, and distribute the software within certain terms. Users should refer to the LICENSE document for full details.

ML-Bench provides a structured, methodical approach for evaluating and enhancing large language models and agents across real-world machine learning tasks by leveraging organized data, Dockerized environments, and comprehensive scripting supports, making it a valuable tool for developers and researchers alike.