hqq - Fast and Accurate Model Quantization Eliminating Calibration Needs

Half-Quadratic Quantization (HQQ) Project Introduction

Overview of HQQ

Half-Quadratic Quantization, known as HQQ, represents a cutting-edge approach for model quantization, which requires no calibration data. This innovative technique permits the quantization of large-scale models at impressive speeds, making it a go-to solution for quick model adaptations without the traditional data-driven calibration process.

Features and Advantages of HQQ

Speed and Versatility: HQQ can swiftly quantize models across various domains, such as language and vision, supporting a diverse range of bit settings (8, 4, 3, 2, and 1 bits). This flexibility makes it applicable to a broad range of model types, including large language models and computer vision applications.
Integration with Optimized Kernels: Utilizing a linear dequantization step, HQQ ensures compatibility with high-performance CUDA/Triton kernels. This integration allows for seamless deployment in environments that support such optimized computational frameworks, enhancing execution speed during inference and training phases.
Compatibility and Optimization: HQQ is designed to work harmoniously with PEFT (Parameter-Efficient Fine-Tuning) training methods and aims to achieve full compatibility with torch.compile, a feature that accelerates both inference and model training.
Economical Use of Resources: With settings like nbits=4, group_size=64, and axis=1, HQQ offers an optimal balance of model quality, memory efficiency, and speed. This makes it suitable for scenarios where computational resources or memory are limited.
Enhancements with HQQ+: An advanced version, HQQ+, incorporates trainable low-rank adapters aiming to improve quantization quality at lower bit depths, providing a refined toolset for fine-tuning complex models.

Basic Implementation Steps

To employ HQQ in a project, the process involves replacing the linear layers within the model framework with HQQ’s specialized layers. Here’s a simplified approach for setting it up in Python:

from hqq.core.quantize import *

# Define quantization settings
quant_config = BaseQuantizeConfig(nbits=4, group_size=64)

# Replace the standard linear layer with HQQ's layer
hqq_layer = HQQLinear(
    your_linear_layer, 
    quant_config=quant_config, 
    compute_dtype=torch.float16,
    device='cuda',
    initialize=True,
    del_orig=True
)

Usage with Transformers and Custom Models

HQQ has been structured to seamlessly integrate with popular frameworks like HuggingFace’s Transformers. For users seeking to adapt its functionalities further, custom configurations can be defined to accommodate unique model architectures and performance requirements.

Here’s an example usage with HuggingFace:

from transformers import AutoModelForCausalLM, HqqConfig

# Specify the quantization configuration
quant_config = HqqConfig(nbits=4, group_size=64)

# Load and quantize a pretrained model
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    torch_dtype=torch.float16,
    device_map="cuda",
    quantization_config=quant_config
)

Enabling Faster Inference

For accelerated inference, HQQ supports various backends that utilize fused kernels. After quantization, these external backends can be activated to optimize processing speeds efficiently:

from hqq.utils.patching import prepare_for_inference

# Enable a specific backend for enhanced inference performance
prepare_for_inference(model, backend="torchao_int4")

Implementation and Community Support

The HQQ project is continuously updated and supported by an active community. Contributions and developments are shared through a variety of examples that demonstrate its application across different technical domains. Users can install the library via pip or directly from the GitHub repository to leverage its full capabilities.

Conclusion

Half-Quadratic Quantization (HQQ) is an innovative and efficient tool for model compression and quantization. Its ability to work without calibration data, coupled with its fast processing speed and compatibility with advanced computational frameworks, makes it an exceptional choice for enhancing model deployment and scalability in machine learning environments. For more detailed comparisons and insights, interested individuals can visit the dedicated HQQ and HQQ+ blog posts.