bench - Unified LLM Evaluation for Diverse Production Needs

Introduction to Bench

Bench is an innovative tool designed to evaluate large language models (LLMs) for practical applications. Whether users need to compare various LLMs, assess the effectiveness of different prompts, or experiment with generation parameters such as temperature and token count, Bench offers a comprehensive platform to streamline and simplify the performance evaluation of LLMs.

Why Use Bench?

Bench proves invaluable in several scenarios for those working with LLMs:

Standardized Evaluation: It allows for a uniform workflow across different tasks and applications, making the evaluation process more systematic and consistent.
Open vs. Closed Source: Bench helps determine if open-source LLMs can match or exceed the performance of proprietary LLM APIs on specific datasets.
Relevant Scoring: It converts rankings from LLM leaderboards into meaningful scores tailored to users' actual use cases.

Community and Support

Bench encourages participation and feedback from its user community. Those interested can join the discussion and connect with other users on Discord. For bug reports and new feature requests, users are encouraged to submit issues on GitHub.

Installation

Installing Bench is straightforward. The recommended method is to include optional dependencies, which facilitates local result serving. Users can install Bench with the following command:

pip install 'arthur-bench[server]'

For a simpler installation with essential features only, use:

pip install arthur-bench

More detailed setup instructions are available in the installation guide.

Getting Started with Bench

To get up and running with Bench, users can refer to the quickstart walkthrough and the test suite creation guide provided in the documentation.

Here's a brief code example to illustrate how a user might create and run a test suite with Bench:

from arthur_bench.run.testsuite import TestSuite
suite = TestSuite(
    "bench_quickstart",
    "exact_match",
    input_text_list=["What year was FDR elected?", "What is the opposite of down?"],
    reference_output_list=["1932", "up"]
)
suite.run("quickstart_run", candidate_output_list=["1932", "up is the opposite of down"])

This test suite saves its state so users can re-load it to test performance over different periods without reprocessing reference data:

existing_suite = TestSuite("bench_quickstart", "exact_match")
existing_suite.run("quickstart_new_run", candidate_output_list=["1936", "up"])

To visualize these results using Bench's local UI, the user can initiate the UI by simply typing bench in the command line—a feature that requires the server dependencies.

Running Bench From Source

For those interested in launching Bench from the source, a series of steps can be followed:

Install Dependencies: Run pip install -e '.[server]' to set up necessary components.
Build the Front End: Navigate to arthur_bench/server/js, then run npm i and npm run build to compile the front-end resources.
Launch the Server: Start the application by typing bench in the terminal.

This setup allows users to implement changes locally, though it's important to restart the server to apply these changes effectively.

In summary, Bench is a powerful and adaptable tool for LLM evaluation, presenting users with a unified platform to enhance the efficiency and effectiveness of their machine learning operations.