marlin - Optimized GPU Kernels for Enhanced Performance in Large Language Model Inference

Marlin Project Overview

Marlin is an advanced computational kernel designed specifically for large language model (LLM) inference. Named after one of the fastest fish, Marlin offers remarkable efficiency improvements by implementing a Mixed Auto-Regressive Linear kernel (FP16xINT4 matmul). Its primary goal is to significantly speed up processes involving large datasets, achieving nearly four times the speed of previous models, especially at larger batch sizes of 16-32 tokens. This makes it ideal for expansive serving needs, speculative decoding, and complex multi-inference strategies.

Key Techniques

Marlin leverages several sophisticated techniques to optimize GPU usage, ensuring it fully utilizes resources like global memory, L2 cache, and tensor cores. Here are some of the core strategies employed:

Efficient Memory Use: Activations are fetched from the L2 cache and reused within registers to minimize loading times. Weight loads are done asynchronously to keep the L2 cache clean.
Double Buffering: This approach is used for shared memory loads, enabling seamless overlap with computation and global loading, reducing delays.
Optimized Dequantization: Instructions are ordered to balance GPU pipelines, preventing bottlenecking.
Reshuffling for Access Patterns: Weights and scales are rearranged offline to ensure optimal access during operations, improving dequantization into tensor core arrangements.
Warp Utilization: Multiple warps within a thread block compute partial results enhancing computation and latency hiding without enlarging the output tile size.
Vector Length & Memory Transformation: Maximizes efficiency by adjusting memory reads and writes for conflict-free operations.
Striped Partitioning: Ensures good utilization of processing units (SMs) across diverse matrix shapes, while minimizing global reduction steps.

Performance and Benchmarks

Marlin shows exceptional performance when compared to other 4-bit inference kernels, maintaining optimal efficiency across different batch sizes. Unlike existing models whose speed diminishes with increased input sizes, Marlin offers consistent maximum speedups.

As demonstrated in benchmarks:

Ideal Speedups: Marlin achieves close to the optimal 3.87x speedup consistently, even as batch sizes grow larger.
Adaptability to Matrix Sizes: Thanks to its partitioning scheme, Marlin adeptly handles smaller real-world matrices on various GPUs.
Sustainable Performance: Even at reduced GPU clock speeds, Marlin’s performance holds steady, unlike competing kernels.

Requirements and Usage

To deploy Marlin, certain prerequisites are necessary, including CUDA 11.8 or higher, an NVIDIA GPU with compute capability of at least 8.0, PyTorch 2.0.0 and above, and related libraries like numpy and transformers. Installation is straightforward using pip in the project’s root directory.

The Marlin kernel can be accessed via a torch module called marlin.Layer, which facilitates converting and using compressed data in models. It’s also possible to call the kernel directly for advanced users with appropriately configured weights and scales.

Example: GPTQ

The project includes a slightly modified version of the GPTQ algorithm compatible with Marlin, allowing for efficient compression of models like Llama2 into a 4-bit format without substantial loss in performance accuracy. Evaluation scripts are provided for assessing model effectiveness across tasks.

Conclusion

Marlin achieves impressive performance efficiencies in LLM inference by combining advanced computing techniques with an intimate understanding of GPU architecture, offering a robust solution for modern computational needs.