k-diffusion - Advanced PyTorch Implementation for Diffusion-Based Generative Models

Introduction to k-diffusion

The k-diffusion project offers a sophisticated implementation of diffusion-based generative models for PyTorch. This implementation is based on insights from the academic paper "Elucidating the Design Space of Diffusion-Based Generative Models" by Karras et al. in 2022. The project includes enhancements like improved sampling algorithms and the introduction of transformer-based diffusion models.

Hourglass Diffusion Transformer

One of the standout features of k-diffusion is the image_transformer_v2, inspired by innovations in models such as the Hourglass Transformer and DiT. This new model type leverages advanced techniques in both sparse and global attention mechanisms.

Requirements

To utilize this model effectively, users may need to install custom CUDA kernels for advanced attention mechanisms:

NATTEN: This is employed for sparse attention at the lower levels of the model hierarchy. There's an alternative model version using shifted window attention which doesn't require custom kernels but is generally slower and less efficient.
FlashAttention-2: Utilized for global attention tasks, though it falls back to the basic PyTorch functionality if not installed.

Furthermore, ensuring compatibility with torch.compile() is recommended for optimal performance, though it can operate in a less efficient mode if necessary.

Usage

For those interested in exploring this model, a demonstration involves training a model on the Oxford Flowers dataset. With the necessary Python packages like Hugging Face Datasets installed, users can easily execute the provided training scripts. The configuration files and parameter settings are flexible, catering to memory constraints and GPU capabilities. Configuration files are critical, as they dictate the model's structure, such as patch sizes, model depth and width, and the type of attention mechanism deployed at each hierarchical level.

Inference

At the moment, detailed instructions on using the model for inference are forthcoming, promising to enhance the user experience once available.

Installation & Training

Installing k-diffusion via PyPI provides access to library code, although users would need to clone the repository for training and inference scripts. Training a model involves specifying configurations unique to the dataset, with current support for datasets like CIFAR-10, MNIST, and others available through Hugging Face. Additionally, the framework efficiently supports multi-GPU and multi-node training configurations utilizing Hugging Face Accelerate.

Enhancements and Features

k-diffusion extends beyond basic functionality with several enhancements:

Support for hierarchical transformer models.
Innovative loss weighting techniques for improved high-resolution training.
Compatibility with various diffusion models, enabling enhanced sampling and control functionalities.
Implementation of DPM-Solver to maintain sample quality efficiently.
Integration of CLIP for guided sampling, as well as precise log likelihood calculations.
On-the-fly calculation of metrics like FID and KID during training, providing immediate feedback on model performance.

The k-diffusion project is a powerful toolkit for researchers and developers interested in leveraging cutting-edge diffusion models to create high-quality generative outcomes. It continues to evolve with ongoing developments like latent diffusion being on the roadmap.