KIVI
KIVI significantly optimizes LLM memory usage and enhances throughput by using a tuning-free 2bit quantization strategy for KV caches. This method decreases memory needs by 2.6 times, enabling larger batch sizes and improving throughput up to 3.47 times. Compatible with models like Llama-2 and Mistral, KIVI maintains model quality while solving speed and memory challenges in inference tasks. Discover new features and examples on GitHub, including improvements for HuggingFace Transformers and support for the Mistral models.