llmperf - Comprehensive Tool for Assessing LLM APIs Performance on Various Platforms

LLMPerf: A Performance Evaluation Tool for LLM APIs

Introduction:
LLMPerf is a robust tool designed to assess the performance of large language model (LLM) APIs. This tool offers two primary types of tests: load tests and correctness tests, which together provide comprehensive insights into how well an LLM is functioning.

Installation

To begin using LLMPerf, users need to clone the repository and install the necessary packages through Python.

git clone https://github.com/ray-project/llmperf.git
cd llmperf
pip install -e .

Basic Usage

LLMPerf implements two main tests:

Load Test

The load test focuses on evaluating the performance through concurrent requests to the LLM API. It measures how quickly and efficiently the API can generate responses to input prompts, which consist of randomly selected lines from Shakespearean sonnets. This test is useful for gauging inter-token latency and generation throughput.

Instructions:

Execute token_benchmark_ray script for performance benchmarking.
Notable considerations include variability in backend performance, impact of different times of the day, load sensitivity, and user workload differences.

Usage Example

For OpenAI Compatible APIs:

export OPENAI_API_KEY=secret_abcdefg
export OPENAI_API_BASE="https://api.endpoints.anyscale.com/v1"

python token_benchmark_ray.py \
--model "meta-llama/Llama-2-7b-chat-hf" \
--mean-input-tokens 550 \
--stddev-input-tokens 150 \
--mean-output-tokens 150 \
--stddev-output-tokens 10 \
--max-num-completed-requests 2 \
--timeout 600 \
--num-concurrent-requests 1 \
--results-dir "result_outputs" \
--llm-api openai \
--additional-sampling-params '{}'

This script allows testing with various LLM service providers by configuring their specific API keys and environment variables like OpenAI, Anthropic, TogetherAI, Hugging Face, LiteLLM, Vertex AI, and SageMaker.

Correctness Test

The correctness test evaluates how accurately the LLM interprets given tasks. For instance, it may request the conversion of numbers presented as words to their numeric format. This test checks accuracy by comparing the generated output with expected results.

Usage Example

For OpenAI Compatible APIs:

export OPENAI_API_KEY=secret_abcdefg
export OPENAI_API_BASE=https://console.endpoints.anyscale.com/m/v1

python llm_correctness.py \
--model "meta-llama/Llama-2-7b-chat-hf" \
--max-num-completed-requests 150 \
--timeout 600 \
--num-concurrent-requests 10 \
--results-dir "result_outputs"

Providers such as Anthropic, TogetherAI, and others can also be tested similarly by using their respective API credentials.

Results Handling

The results from both tests are stored in a specified directory (--results-dir). They include summary metrics and individual request data.

Advanced Usage

LLMPerf allows for advanced usage patterns where users can define custom workflows using Python code. For example, users can initialize a set of clients for processing prompt requests dynamically and retrieve performance results.

Custom Client Implementation

Users with specific needs can extend the capability of LLMPerf by implementing their own LLM client. This involves creating a new class based on the LLMClient interface and decorating it as a Ray actor to handle asynchronous requests effectively.

Example:

from llmperf.ray_llm_client import LLMClient
import ray

@ray.remote
class CustomLLMClient(LLMClient):
    
    def llm_request(self, request_config):
        ...

Legacy Codebase

For those interested, the original LLMPerf codebase is still accessible through the llmperf-legacy repository.

Overall, LLMPerf provides a detailed and flexible platform for evaluating the performance and accuracy of different LLM APIs, allowing users to gain valuable insights into their systems.