llm-compressor - Optimize Machine Learning Models with Comprehensive Quantization for Efficient Deployment

LLM Compressor: Streamlining Model Deployment for Faster Inference

The LLM Compressor is a user-friendly library designed to optimize machine learning models for deployment using vllm. This innovative tool offers a comprehensive suite of quantization algorithms and supports various model formats, making the process of preparing models for inference both efficient and straightforward.

Key Features

The primary functions of the LLM Compressor revolve around quantization, which is the process of reducing the precision of the weights and activations in a model. This not only decreases the model size but also speeds up inference, which is crucial for deploying models on resource-constrained devices or environments.

Here are some of the standout features:

Quantization Algorithms: The library supports multiple algorithms for both weight-only and activation quantization. This includes Simple PTQ, GPTQ, SmoothQuant, and SparseGPT, providing users with a range of tools to best suit their needs.
Integration with Hugging Face: It seamlessly integrates with Hugging Face models and repositories, enabling users to easily access and utilize a wide array of pre-trained models.
File Format Compatibility: LLM Compressor uses a safetensors-based file format that is compatible with vllm, ensuring secure and efficient model handling.
Large Model Support: Through the use of accelerate, it can manage and optimize large models, making it versatile for a variety of machine learning tasks.

Supported Formats and Algorithms

The library supports diverse quantization formats and algorithms:

Activation Quantization: Formats like W8A8 using int8 and fp8 are supported.
Mixed Precision: Formats such as W4A16 and W8A16 offer flexibility in precision balancing.
Sparsity: Supports 2:4 semi-structured and unstructured sparsity, contributing to model efficiency.

Getting Started

Installation is straightforward with pip:

pip install llmcompressor

After installation, users can quickly get started with examples provided by the library. Examples demonstrate how to apply various quantization techniques, such as converting activations to int8 or fp8 and performing weight-only quantizations.

Quick Tour and Usage

With the LLM Compressor, quantization is a breeze. For instance, a user can apply quantization to a model like TinyLlama using 8-bit weights and activations via the GPTQ and SmoothQuant algorithms.

Here's a brief walkthrough:

Apply Quantization

Select the appropriate algorithm, configure the quantization parameters, and utilize the oneshot API to apply the changes to the selected model.

from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
from llmcompressor.transformers import oneshot

# Define the quantization recipe
recipe = [
    SmoothQuantModifier(smoothing_strength=0.8),
    GPTQModifier(scheme="W8A8", targets="Linear", ignore=["lm_head"]),
]

# Apply the quantization
oneshot(
    model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    dataset="open_platypus",
    recipe=recipe,
    output_dir="TinyLlama-1.1B-Chat-v1.0-INT8",
    max_seq_length=2048,
    num_calibration_samples=512,
)

Inference with vLLM

Once quantized, the models can be effortlessly loaded into vllm for running inference tasks.

pip install vllm

from vllm import LLM
model = LLM("TinyLlama-1.1B-Chat-v1.0-INT8")
output = model.generate("My name is")

Support and Contribution

The LLM Compressor welcomes community support and contributions. Users can raise issues or feature requests on GitHub, contribute to code development, or enhance documentation. This collaborative approach ensures continuous improvement and adaptation to user needs.

In conclusion, LLM Compressor provides an efficient, user-friendly solution for optimizing machine learning models, making it a vital tool for developers looking to enhance their model deployment processes.