gpu_poor - Optimize GPU Performance for LLMs with Memory and Token Metrics

Introduction to the GPU_Poor Project

The GPU_Poor project is a practical tool designed to help users analyze the capability of their GPUs or CPUs to run different Large Language Models (LLMs). The project provides insights into the GPU memory requirements and potential performance in terms of tokens per second for any given model.

Key Features and Use Cases

1. Calculate GPU Memory Requirements

The tool offers a detailed breakdown of how GPU memory is utilized when running LLMs. It identifies memory allocation across various components such as KV Cache, Model Size, Activation Memory, and other overheads. This feature is indispensable for understanding the memory needs of LLMs, allowing users to plan accordingly.

Memory Requirement

2. Estimate Tokens per Second

The project estimates how many tokens per second a GPU can process when running LLMs. This is crucial for gauging the performance speed of models on different hardware setups.

Token Calculation

3. Determine Finetuning Time

For users interested in finetuning models, the tool approximates the time required per iteration (measured in milliseconds). Knowing this helps users assess the feasibility and duration of training LLMs with specific GPUs.

Finetuning Time

Detailed Breakdown of Memory Usage

The tool provides an in-depth view of memory allocation, making it easier to identify areas consuming significant resources and potential adjustments to optimize the performance. Here's an example output illustrating memory breakdown:

{
  "Total": 4000,
  "KV Cache": 1000,
  "Model Size": 2000,
  "Activation Memory": 500,
  "Grad & Optimizer memory": 0,
  "cuda + other overhead":  500
}

Purpose and Utility

The GPU_Poor project serves multiple purposes, such as:

Determining the tokens per second a GPU can achieve.
Estimating the total time required for model finetuning.
Identifying suitable quantization for a given GPU.
Assessing maximum context length and batch size a GPU can handle.
Analyzing which elements consume the most GPU memory and exploring alternatives for optimizing LLM loading on the GPU.

Additional Considerations and Accuracy

Estimating capabilities based solely on model size is often insufficient due to additional memory requirements like KV Cache during inference. The project's results account for these factors, ensuring estimates are within a 500MB margin of actual usage. An example comparison table verifies accuracy across various models and GPUs, reinforcing the reliability of these estimations.

Calculation Methodology

The project's estimations are derived using a combination of model size, KV Cache requirement, activation memory, optimizer and gradient memory, and other overheads. The methodology ensures accurate and comprehensive analysis accommodating various quantization methods and configurations.

Current and Future Development

Ongoing developments aim to include features such as support for vLLM concerning tokens per second and integration of other quantization methods like AWQ. The project continues to evolve, adding functionalities to offer users even greater precise insights into their hardware capabilities in relation to LLMs.

Conclusion

The GPU_Poor project provides a valuable toolset for users aiming to understand and optimize their GPU utility for LLMs. It empowers users with essential data, helping them make informed decisions about model deployment and training logistics based on their hardware's potential.