hqq
Half-Quadratic Quantization (HQQ) offers a method for fast and detailed model quantization without requiring calibration data, suitable for large-scale models. It supports quantization from 8 to 1 bit and is compatible with optimized kernels and peft training. The method enhances both inference and training speed, particularly with torch.compile integration. Configurable settings such as nbits and group size allow for achieving optimal model quality. The tool is user-friendly, needing only minimal layer replacements and allowing for tailored configurations. Available via pip, it seamlessly integrates with HuggingFace transformers for effective quantization.