distill-sd - Condensed Stable Diffusion Models for Enhanced Speed and Efficiency

Introduction to the Distilled Stable Diffusion Project

The Distilled Stable Diffusion project focuses on delivering a more efficient, faster, and smaller version of the popular Stable Diffusion model. This initiative is built upon the principles of knowledge distillation, following the methodologies outlined in the BK-SDM framework. By using distilled versions of the model, users can generate images with quality comparable to the original Stable Diffusion while benefiting from enhanced speed and reduced model size.

Key Components of the Project

Data Acquisition and Training Scripts

Data Preparation: The scripts in data.py facilitate the downloading of necessary datasets for training the models.
Model Training: The distill_training.py script focuses on training the U-net component of the model using methods described in the BK-SDM paper. This process includes configuring batch sizes, model types (sd_small or sd_tiny), and other hyperparameters. Training can leverage the popular Hugging Face diffusers library for LoRA training and resume from checkpoints.

Training Details

The project utilizes a teacher-student model training paradigm, where a smaller model learns from a larger, pre-trained teacher model. Specifically, the U-net architecture from SG161222/Realistic_Vision_V4.0 serves as the teacher, guiding the student model to mimic its outputs. The primary training objective is a multi-loss function that considers:

The mean squared error (MSE) between the noisily predicted outputs of teacher and student models.
The MSE between actual and predicted noise.
Feature-level losses across the networking layers.

Training Hyperparameters

The model training involves the following settings:

Learning rate: 1e-5
Scheduler Type: Cosine
Batch Size: 32
Output and Feature Weights: 0.5 each

Model Parameters and Variants

Normal Stable Diffusion U-net: Contains 859,520,964 parameters.
SD_Small U-net: Reduced to 579,384,964 parameters.
SD_Tiny U-net: Further compacted to 323,384,964 parameters.

Usage and Application

A typical usage scenario involves setting up a text-to-image generation pipeline. Users can select a pre-trained distilled model (e.g., "segmind/small-sd") and input prompts to generate images. The script defines how prompts are processed, taking advantage of CUDA-supported inference optimizations.

Training the Model

The project instructions facilitate model training from checkpoints, employing distilled versions of Stable Diffusion for text-to-image tasks. Users can explore different distillation levels (sd_small or sd_tiny) and modify training scripts to suit specific datasets, much as done with the Hugging Face diffusers.

Available Models and Resources

Pre-trained versions are hosted on Hugging Face's platform for ease of access:

A "sd-small" model: Huggingface repo
A "sd-tiny" model: Huggingface repo
A portrait fine-tuned "sd-tiny" model: Huggingface repo

Advantages and Limitations

Advantages:

Speed: Up to 100% faster inference times.
Memory Efficiency: 30% lower VRAM usage.
Training: Enhanced speed for DreamBooth and LoRA training.

Limitations:

Produced images may lack production-level quality in some instances.
The distilled models are better suited for fine-tuning on specific styles or concepts rather than general use.
Limited capability in handling complex compositions or multiple concepts.

Future Directions

Looking ahead, the project aims to:

Develop SDXL distilled models.
Refine base models for improved composability.
Integrate Flash Attention-2 for speed improvements.
Explore TensorRT and AITemplate for enhanced acceleration.
Implement Quantization-Aware-Training (QAT) in the distillation process.

Conclusion

The Distilled Stable Diffusion project exemplifies the advancements in AI model efficiency through knowledge distillation. By significantly reducing model sizes and increasing processing speeds, it opens avenues for more accessible and scalable text-to-image generation solutions.

Acknowledgments

The project team extends gratitude to Nota AI for their foundational research, which significantly contributed to model compression innovations.

This structured introduction aims to provide a clear and encompassing overview of the Distilled Stable Diffusion project for a diverse audience, focusing on its contributions, methodologies, and future directions.