FlashInfer Project Introduction
FlashInfer is a groundbreaking library designed for Large Language Models (LLMs), focusing on delivering high-performance computing through optimized GPU kernels. This advanced tool primarily serves the needs of LLM serving and inference tasks, offering state-of-the-art performance across varied scenarios.
Key Features of FlashInfer
-
Comprehensive Attention Kernels: FlashInfer boasts a comprehensive suite of attention kernels tailored for common LLM serving needs. It efficiently handles single-request and batching operations for Prefill, Decode, and Append tasks. The kernels work seamlessly with different formats of KV-Cache, including Padded Tensor, Ragged Tensor, and Page Table, to ensure versatility and high efficiency.
-
Optimized Shared-Prefix Batch Decoding: A standout feature of FlashInfer is its enhanced shared-prefix batch decoding, achieved through a method called "cascading." This optimization results in up to a 31x speed improvement compared to traditional methods, especially when handling large prompts or batch sizes.
-
Accelerated Attention for Compressed/Quantized KV-Cache: Recognizing the trend of deploying LLMs with quantized or compressed KV-Cache to minimize memory usage, FlashInfer optimizes performance for such scenarios. It enhances operations like Grouped-Query Attention, Fused-RoPE Attention, and Quantized Attention, making it an invaluable tool for modern applications.
Compatibility and Integration
FlashInfer is not just powerful but also flexible, offering support for PyTorch, TVM, and C++ (header-only) APIs. This makes it easy to integrate into existing projects without needing extensive modifications.
Getting Started
For those eager to try FlashInfer, the easiest approach is through its PyTorch API. Here’s a quick guide on installation and usage:
Installation
FlashInfer provides prebuilt wheels for Linux, which simplifies the installation process:
# For CUDA 12.4 & torch 2.4
pip install flashinfer -i https://flashinfer.ai/whl/cu124/torch2.4
For different CUDA and PyTorch versions, users can refer to the official documentation for guidance.
Example Usage
Below is an example demonstrating FlashInfer’s capabilities in handling single-request attention tasks:
import torch
import flashinfer
kv_len = 2048
num_kv_heads = 32
head_dim = 128
k = torch.randn(kv_len, num_kv_heads, head_dim).half().to(0)
v = torch.randn(kv_len, num_kv_heads, head_dim).half().to(0)
# Decode attention
num_qo_heads = 32
q = torch.randn(num_qo_heads, head_dim).half().to(0)
o = flashinfer.single_decode_with_kv_cache(q, k, v)
Benchmarking and Performance
FlashInfer’s performance can be evaluated using nvbench, a powerful tool for profiling kernel efficiency. Users can easily compile and run benchmarks to assess the tool's efficacy under different scenarios.
Adoption and Influence
FlashInfer is already making waves in the tech world, having been adopted by a variety of prominent projects including MLC-LLM, Punica, SGLang, ScaleLLM, vLLM, and TGI. This wide adoption underscores its reliability and effectiveness.
Acknowledgements
FlashInfer draws inspiration from a host of pioneering projects such as FlashAttention, vLLM, stream-K, and the NVIDIA Cutlass library, reflecting its modern approach and robust foundation.
Overall, FlashInfer offers an excellent toolkit for developers and researchers working with LLMs, delivering unmatched performance, flexibility, and ease of integration. It stands as a testament to the power of innovation in the realm of machine learning and artificial intelligence.