GPU-Benchmarks-on-LLM-Inference
This study evaluates the inference performance of different GPUs, such as NVIDIA and Apple Silicon, for LLaMA 3 models using llama.cpp. It includes detailed benchmarks on RunPod and various MacBook models. The focus is on average speeds for 1024 token generation and prompt evaluation, presented in tokens per second. The insights can assist in optimizing GPU choices for improved large language model operations.