stable-fast - Fast Optimization for Efficient Inference on NVIDIA GPUs

Project Introduction: Stable Fast

What is Stable Fast?

Stable Fast is an ultra-lightweight inference optimization framework designed specifically for HuggingFace Diffusers, which are used on NVIDIA GPUs. It targets achieving state-of-the-art inference performance on every type of diffuser model, including the latest developments like the StableVideoDiffusionPipeline. Unlike other complex tools like TensorRT, which often take a significant time to compile models, Stable Fast can achieve this in just a few seconds.

Key capabilities of Stable Fast include support for dynamic shapes, Low-Rank Adaptation (LoRA), and ControlNet, enhancing its flexibility and usability. These features help to push the performance limits of diffusion models without sacrificing ease or speed.

Differences With Other Acceleration Libraries

Speed: Stable Fast outperforms other tools like torch.compile, TensorRT, and AITemplate, especially during initial compilation.
Minimalism: It works like a plug-in by utilizing existing PyTorch functionality, making it compatible with various acceleration techniques and deployment solutions.
Compatibility: It seamlessly integrates with all versions of HuggingFace Diffusers and PyTorch. Unique to Stable Fast, it supports ControlNet, LoRA, and is ready to optimize the latest StableVideoDiffusionPipeline.

Installation

Stable Fast is primarily tested on Linux and Windows Subsystem for Linux (WSL2). It's crucial to have PyTorch with CUDA support installed; versions from 1.12 to 2.1 are recommended due to known compatibility. Users can either install prebuilt wheels from the release page or build from source by fulfilling dependencies like CUDNN/CUBLAS, Triton, and others to ensure functional compatibility.

Usage

Stable Fast offers diverse applications:

Optimize StableDiffusionPipeline: Direct optimization is possible, enhancing efficiency and performance for models like StableDiffusionPipelineXL.
Enhance LCM Pipeline: It's equipped to handle the latest latent consistency model pipeline, delivering notable speed improvements.
Improve StableVideoDiffusionPipeline: Experience over twofold speed increase for pipeline processing.
Dynamic LoRA Switching: Although requiring careful execution, model parameters can be updated in real-time without precluding optimization benefits.
Model Quantization: By utilizing extended PyTorch quantization functionalities, users can achieve VRAM reductions, which are critical for resource-heavy operations like running diffusers models.

Performance Comparison

In benchmarks, Stable Fast demonstrates its supremacy by efficiently managing models like SD 1.5 with impressive timing results compared to other tools, balancing both speed and versatility through unique processing methodologies such as CUDNN Convolution Fusion, low precision fused GEMM, and others.

Conclusion

Stable Fast aims to remain a leading player in the field of inference optimization for diffusers, poised to expand its capabilities to further enhance large language models (LLMs) and provide efficiency improvements across various aspects of model deployment and performance. Its future releases are anticipated to be even more stable and user-friendly, reinforcing its value as a go-to solution for developers in the domain.