llmperf-leaderboard - Evaluate LLM Inference Providers to Understand Performance Metrics

Introduction to LLMPerf Leaderboard

The LLMPerf Leaderboard is a performance benchmarking initiative that evaluates various Language Model (LM) inference providers. Through thorough analysis using the LLMPerf tool, the project assesses the efficiency, reliability, and performance of these providers through two crucial metrics: Output Tokens Throughput and Time to First Token (TTFT).

Key Metrics and Their Importance

Output Tokens Throughput: This metric measures the average number of tokens a provider can return per second. It is crucial for applications demanding high throughput, such as text summarization and language translation. A higher number reflects a more efficient output delivery by the LM inference provider.
Time to First Token (TTFT): This represents how quickly a provider returns the initial token and is particularly important for streaming applications like chatbots. The shorter the TTFT, the quicker the response from the LM, enhancing the user's interactive experience.

Displaying Results for Informed Decisions

The results of this benchmarking are presented transparently on the LLMPerf Leaderboard. The goal is to offer developers and users critical insights into each provider's strengths and weaknesses, which aids in making informed decisions about future technology integration or deployments. For full transparency, the leaderboard provides reproducible steps in the Run Configurations section.

Run Configurations

The benchmarks are run using a command template from the LLMPerf repository, as follows:

python token_benchmark_ray.py \
--model <MODEL_NAME> \
--mean-input-tokens 550 \
--stddev-input-tokens 0 \
--mean-output-tokens 150 \
--stddev-output-tokens 0 \
--max-num-completed-requests 150 \
--num-concurrent-requests 5 \
--llm-api <litellm/openai>

Key configurations used in the tests:

Total number of requests: 150
Concurrency: 5 (requests made concurrently)
Prompt's token length: 550
Expected output length: 150
Tested models: 7B, 13B, and 70B versions of LLama-2 chat models

These tests were conducted on an AWS EC2 Instance (i4i.large) located in the us-west-2 (Oregon) region, with the last results published on December 19, 2023.

Caveats and Disclaimers

It’s important to note potential biases and discrepancies:

Providers' backend systems can vary widely and might not reflect the software's performance on specific hardware.
Results can differ based on the time of day.
The TTFT measurement could be affected by the location of the client, with testing currently done from the us-west (Oregon) region.
The findings act as a proxy of system capabilities, influenced by system load and provider traffic. They may not directly align with users' workloads.

Output Tokens Throughput and Time to First Token Results

The leaderboard presents comprehensive tables and graphs displaying the Output Tokens Throughput and TTFT for models of varying sizes (70B, 13B, and 7B). For each model and provider, key statistics (median, mean, minimum, maximum, and percentile values) are provided, highlighting performance variations across different platforms.

Feedback and Participation

The LLMPerf team encourages feedback and is open to collaboration. LLM inference service providers interested in being featured on the leaderboard can reach out for further discussions on collaboration and setup.

For feedback or further communication, participants are encouraged to use the GitHub issue tracker or contact via email as provided.

By offering this curated performance insight, the LLMPerf Leaderboard enables a deeper understanding and better choices in leveraging LLM providers for various applications.