neural-compressor - Enhance AI Model Efficiency with Intel's Neural Compressor Using TensorFlow and PyTorch

Introducing Intel® Neural Compressor

What is Intel® Neural Compressor?

Intel® Neural Compressor is an open-source Python library designed to enhance the efficiency of deep learning models. It achieves this through popular model compression techniques like quantization, pruning, distillation, and neural architecture search. The library is compatible with all major deep learning frameworks, including TensorFlow, PyTorch, and ONNX Runtime. This ensures wide applicability across varied development environments.

Key Features

Broad Hardware Support: The library is optimized for a diverse array of Intel hardware, such as Gaudi AI Accelerators, Core Ultra Processors, and Xeon CPUs, among others. It also extends support to AMD CPUs, ARM CPUs, and NVIDIA GPUs through ONNX Runtime.
Wide Model Validation: The tool has been tested with numerous popular language models (LLMs) like LLama2, Falcon, and GPT-J. It supports more than 10,000 models from credible sources like Hugging Face, Torch Vision, and ONNX Model Zoo. The inclusion of automatic, accuracy-driven quantization strategies enhances its usability for developers.
Collaborative Ecosystem: Intel collaborates with major cloud platforms like Google Cloud, AWS, and Azure, as well as software platforms such as Alibaba Cloud and Tencent TACO. There's also collaboration with AI ecosystems, including Hugging Face and PyTorch.

Recent Updates

A new Transformers-like API, introduced in October 2024, supports INT4 inference on Intel CPUs and GPUs.
Performance optimizations and usability improvements were added mid-2024, making the tool even more efficient.

Installation Guide

You can install Neural Compressor using pip for both PyTorch and TensorFlow dependencies. The installation can be as simple as using:

pip install neural-compressor[pt]
pip install neural-compressor[tf]

For detailed setup instructions, particularly around hardware configuration, referencing the comprehensive installation guide is recommended.

Getting Started

After setting up the environment, users are encouraged to explore quantization techniques like FP8 Quantization tailored for Intel Gaudi2 AI Accelerator. Sample code provided in the documentation illustrates an easy integration process for using Intel Gaudi2 with Docker images.

from neural_compressor.torch.quantization import (
    FP8Config,
    prepare,
    convert,
)
import torchvision.models as models

model = models.resnet18()
qconfig = FP8Config(fp8_config="E4M3")
model = prepare(model, qconfig)
... # Calibration and conversion steps follow

Detailed Documentation and Resources

The documentation offers in-depth insights into the architecture, workflows, and available APIs. The library also features PyTorch and TensorFlow extension APIs, supporting dynamic and static quantization, among other techniques.

Publications and Community Engagement

Intel actively participates in the academic community, with publications and events such as presentations at EMNLP 2024. Developers and researchers can engage through GitHub issues, email, or join discussions on Discord and WeChat.

Conclusion

Intel® Neural Compressor stands out as a powerful tool for model compression, making AI models more efficient without sacrificing performance. Its broad support for hardware, frameworks, and collaboration with major platforms makes it a versatile choice for developers aimed at optimizing AI workloads.