flashinfer - Efficient GPU Kernels for Diverse LLM Inference Needs

FlashInfer Project Introduction

FlashInfer is a groundbreaking library designed for Large Language Models (LLMs), focusing on delivering high-performance computing through optimized GPU kernels. This advanced tool primarily serves the needs of LLM serving and inference tasks, offering state-of-the-art performance across varied scenarios.

Key Features of FlashInfer

Comprehensive Attention Kernels: FlashInfer boasts a comprehensive suite of attention kernels tailored for common LLM serving needs. It efficiently handles single-request and batching operations for Prefill, Decode, and Append tasks. The kernels work seamlessly with different formats of KV-Cache, including Padded Tensor, Ragged Tensor, and Page Table, to ensure versatility and high efficiency.
Optimized Shared-Prefix Batch Decoding: A standout feature of FlashInfer is its enhanced shared-prefix batch decoding, achieved through a method called "cascading." This optimization results in up to a 31x speed improvement compared to traditional methods, especially when handling large prompts or batch sizes.
Accelerated Attention for Compressed/Quantized KV-Cache: Recognizing the trend of deploying LLMs with quantized or compressed KV-Cache to minimize memory usage, FlashInfer optimizes performance for such scenarios. It enhances operations like Grouped-Query Attention, Fused-RoPE Attention, and Quantized Attention, making it an invaluable tool for modern applications.

Compatibility and Integration

FlashInfer is not just powerful but also flexible, offering support for PyTorch, TVM, and C++ (header-only) APIs. This makes it easy to integrate into existing projects without needing extensive modifications.

Getting Started

For those eager to try FlashInfer, the easiest approach is through its PyTorch API. Here’s a quick guide on installation and usage:

Installation

FlashInfer provides prebuilt wheels for Linux, which simplifies the installation process:

# For CUDA 12.4 & torch 2.4
pip install flashinfer -i https://flashinfer.ai/whl/cu124/torch2.4

For different CUDA and PyTorch versions, users can refer to the official documentation for guidance.

Example Usage

Below is an example demonstrating FlashInfer’s capabilities in handling single-request attention tasks:

import torch
import flashinfer

kv_len = 2048
num_kv_heads = 32
head_dim = 128

k = torch.randn(kv_len, num_kv_heads, head_dim).half().to(0)
v = torch.randn(kv_len, num_kv_heads, head_dim).half().to(0)

# Decode attention
num_qo_heads = 32
q = torch.randn(num_qo_heads, head_dim).half().to(0)

o = flashinfer.single_decode_with_kv_cache(q, k, v)

Benchmarking and Performance

FlashInfer’s performance can be evaluated using nvbench, a powerful tool for profiling kernel efficiency. Users can easily compile and run benchmarks to assess the tool's efficacy under different scenarios.

Adoption and Influence

FlashInfer is already making waves in the tech world, having been adopted by a variety of prominent projects including MLC-LLM, Punica, SGLang, ScaleLLM, vLLM, and TGI. This wide adoption underscores its reliability and effectiveness.

Acknowledgements

FlashInfer draws inspiration from a host of pioneering projects such as FlashAttention, vLLM, stream-K, and the NVIDIA Cutlass library, reflecting its modern approach and robust foundation.

Overall, FlashInfer offers an excellent toolkit for developers and researchers working with LLMs, delivering unmatched performance, flexibility, and ease of integration. It stands as a testament to the power of innovation in the realm of machine learning and artificial intelligence.