Project Icon

marlin

Optimized GPU Kernels for Enhanced Performance in Large Language Model Inference

Product DescriptionMarlin, an FP16xINT4 optimized kernel, accelerates LLM inference with batch sizes of 16-32 tokens using advanced GPU techniques. It outperforms comparable kernels under various GPU conditions and is easily integrated with CUDA and torch. Key features include asynchronous global weight management and efficient resource allocation.
Project Details