GPU-Benchmarks-on-LLM-Inference - Analyzing GPU Capabilities in Executing LLaMA 3 Models

Introduction to GPU-Benchmarks-on-LLM-Inference

The project, GPU-Benchmarks-on-LLM-Inference, asks a fundamental question for AI enthusiasts and professionals: should one use multiple NVIDIA GPUs or Apple's Silicon for Large Language Model Inference? This inquiry is particularly important when dealing with LLaMA models, which are powerful tools in natural language processing.

Description and Goal

The main purpose of this project is to evaluate the inference speed of LLaMA models across various GPU setups. It does so using llama.cpp, a tool leveraged to test the LLaMA models, and runs these tests on hardware ranging from NVIDIA GPUs, Apple 13-inch M1 MacBook Air, 14-inch M1 Max MacBook Pro, M2 Ultra Mac Studio, to the 16-inch M3 Max MacBook Pro. These evaluations reveal how different GPUs handle complex computations associated with language models.

Performance Overview

The key metric here is the speed at which GPUs can process 1024 tokens in LLaMA 3. This is measured in tokens per second with higher speeds indicating better performance. Various models and configurations were tested, with the H100 PCIe 80GB emerging as a top performer.

Results showed that GPUs could handle the 8B Q4_K_M and 8B F16 models reasonably well, but many fell short with larger models like 70B due to memory constraints, indicated by Out of Memory (OOM) errors.

Model Weights and Access

The project acknowledges shawwn for the LLaMA model weights across various sizes, specifically 7B, 13B, 30B, and 65B. These models are pivotal for training and inferencing tasks, and the project directs users to access the current LLaMA models via Meta AI and Hugging Face repositories for ease of use.

How to Use the Project

Building the Environment

For NVIDIA GPUs, it supports BLAS acceleration using CUDA cores, essential for high-performance computation:

!make clean && LLAMA_CUBLAS=1 make -j

For Apple Silicon, Metal is enabled by default which optimizes performance for Apple's hardware:

!make clean && make -j

Executing Text Completion and Chat Templates

Using the provided command lines, users can execute text completions and simulate chat templates with LLaMA 3. For instance, to ensure that all inference is handled by the GPU:

!./main -ngl 10000 -m ./models/8B-v3/ggml-model-Q4_K_M.gguf --color --temp 1.1 --repeat_penalty 1.1 -c 0 -n 1024 -e -s 0 -p ...

This serves as an example where users can modify inputs as needed for different scenarios.

Benchmarking

Running benchmarks provides insights into how these models perform under various loads:

!./llama-bench -p 512,1024,4096,8192 -n 512,1024,4096,8192 -m ./models/8B-v3/ggml-model-Q4_K_M.gguf

VRAM and Capacity Requirements

The documentation includes a guide on VRAM requirements for different model sizes using quantized and original formats, helping potential users to estimate their hardware needs via tools like the LLM RAM Calculator.

Perplexity Assessment

Perplexity tests highlight the efficiency of language models, with a detailed table showing how different model quantizations affect performance. Lower perplexity values denote better predictions by the model.

Final Observations

The benchmarks, executed across a variety of setups including popular NVIDIA gaming GPUs running Ubuntu with specific setups, reveal significant variations in performance, processing speeds, and memory utilization. This helps users in identifying the most cost-effective and performance-efficient hardware for their natural language processing tasks.

Overall, this project provides a comprehensive evaluation of GPU performance for large language models, helping researchers and developers make informed decisions about their hardware configurations to optimize language model inferences.