marlin
Marlin, an FP16xINT4 optimized kernel, accelerates LLM inference with batch sizes of 16-32 tokens using advanced GPU techniques. It outperforms comparable kernels under various GPU conditions and is easily integrated with CUDA and torch. Key features include asynchronous global weight management and efficient resource allocation.