Project Icon

flashinfer

Efficient GPU Kernels for Diverse LLM Inference Needs

Product DescriptionFlashInfer is a library offering high-performance GPU kernels for Large Language Model serving, including FlashAttention and SparseAttention. It covers diverse applications such as single-request and batch processing using various KV-Cache formats. The library specializes in optimizing shared-prefix batch decoding with cascading techniques, providing up to 31x speedup. FlashInfer supports integration with PyTorch, TVM, and C++ APIs, facilitating efficient memory use and fast deployments in quantized attention contexts for modern LLMs.
Project Details