fast-DiT - Scalable Diffusion Models Utilizing Transformers with Improved PyTorch Implementation

Exploring DiT: Scalable Diffusion Models with Transformers

The fast-DiT project is an innovative approach to working with scalable diffusion models using transformers, aimed at improving performance and efficiency in image generation tasks. This project, an advanced PyTorch implementation, enhances the functionality outlined in the paper "Scalable Diffusion Models with Transformers."

Project Overview

Fast-DiT features an upgraded version of the DiT (Diffusion ResU-Net with Transformers) built on top of the original one. This improved model is designed using pre-trained class-conditional models that have been tailored for use with ImageNet datasets at resolutions of 512x512 and 256x256 pixels. Users can easily access these models through user-friendly platforms such as Hugging Face Spaces and Google Colab, ensuring accessibility and ease of use.

Setting Up the Environment

To get started, users need to clone the repository and set up their environment with Conda. This setup is streamlined to ensure that even users intending to work with models locally on their CPUs can omit unnecessary dependencies.

git clone https://github.com/chuanyangjin/fast-DiT.git
cd DiT
conda env create -f environment.yml
conda activate DiT

Sampling from DiT Models

Fast-DiT provides pre-trained checkpoints, allowing users to sample images using straightforward scripts. Depending on the desired image resolution, users can seamlessly switch between 256x256 and 512x512 models. Sampling involves simple command-line inputs that direct Python scripts to generate images based on the provided parameters.

python sample.py --image-size 512 --seed 1

Additionally, the project supports custom checkpoints, where users can apply their trained models by specifying the required arguments. This flexibility enables researchers and developers to explore various image generation possibilities with ease.

Training the DiT Models

Fast-DiT includes a comprehensive training script to guide users in training class-conditional DiT models. This script can be adjusted for other types of conditioning, providing versatility in configuration. The project also offers options for distributed training using multiple GPUs, thus accommodating different computational resources.

torchrun --nnodes=1 --nproc_per_node=1 extract_features.py --model DiT-XL/2 --data-path /path/to/imagenet/train --features-path /path/to/store/features

The project documentation details setup procedures and training scripts, ensuring that users can readily engage in image generation experiments on large-scale datasets.

Evaluation of Model Performance

Fast-DiT includes evaluation scripts that facilitate the computation of key metrics such as FID (Fréchet Inception Distance) and Inception Score. These metrics are essential for assessing the quality and diversity of generated images, making it easier for researchers to evaluate the performance of their models against established benchmarks.

torchrun --nnodes=1 --nproc_per_node=N sample_ddp.py --model DiT-XL/2 --num-fid-samples 50000

Advancements in Training Efficiency

Compared to its predecessor, Fast-DiT incorporates enhancements aimed at accelerating training speed and reducing memory demands. Using mixed precision training and pre-extracted features, the project achieves significant improvements in both computation time and resource usage, enabling more efficient training processes.

Conclusion

The Fast-DiT project is a powerful tool for those looking to harness the potential of diffusion models and transformers in image generation. Its streamlined setup, advanced training capabilities, and thorough evaluation frameworks make it a valuable asset for both academic research and practical applications in artificial intelligence and computer vision. For further information and experimentation, users are encouraged to explore the project's resources and contribute to its ongoing development.