ECT: Unlocking Consistency in Generative Models
The Easy Consistency Tuning (ECT) project is a breakthrough in generative modeling, delivering state-of-the-art few-step generative capabilities through a streamlined and principled approach. It reduces tuning costs significantly while still achieving promising early results and scalability with increased training operations and larger model sizes.
Introduction
The ECT project is structured into multiple branches, each providing a minimal implementation tailored for specific training purposes. Here's a brief look at what each branch supports:
- Main Branch: Focuses on ECT with the CIFAR-10 dataset, useful for understanding Consistency Models (CMs) and quick prototyping.
- AMP Branch: Supports Mixed Precision training with GradScaler on CIFAR-10, allowing for more computational efficiency.
- ImageNet Branch: Implements ECT on the ImageNet dataset with a resolution of 64x64.
Recent Updates
ECT continues to evolve with new updates:
- October 12, 2024: ECT code for ImageNet 64x64 has been added.
- September 23, 2024: Introduction of Gradscaler for Mixed Precision Training.
- April 27, 2024: Transitioned to Pytorch 2.3.0.
- April 12, 2024: ECMs can now outperform State-of-the-Art GANs using a single model step and leading Diffusion Models with two steps on CIFAR-10.
Environment Setup
Setting up the Python environment for ECT is straightforward. By running a simple command with conda
, users can install Pytorch 2.3.0 and Python 3.9.18, making the environment ready for experimentation and development.
Datasets and Training
Datasets should be prepared in the EDM format, and there are scripts available to assist with training:
- For fine-tuning the state-of-the-art two-step ECM and achieving Consistency Distillation (CD) in just one hour using a single A100 GPU, specific commands are provided.
- Full training with a batch size of 128 across 200k iterations can be executed, with recommended use of 2 to 4 GPUs.
Half Precision Training
To enhance training stability, the AMP GradScaler allows fp16 training, which can be enabled through simple script adjustments. This feature aims to streamline and stabilize the training process further.
Evaluation
The ECT provides commands to evaluate the Frechet Inception Distance (FID) of pretrained checkpoints, ensuring the model’s generative capabilities meet high standards.
Generative Performance
ECT models are benchmarked against existing SoTA generative models on the CIFAR-10 dataset, demonstrating remarkable results. They surpass several popular diffusion and consistency models by offering better image generation quality with fewer computational steps.
Advanced Evaluation Metrics
Using the DINOv2 representation model, ECT also evaluates image fidelity through the Fréchet Distance in latent space, offering a more human-aligned measure of visual quality. This evaluation further positions ECT as a competitive force in generating high-quality images faster than other SoTA methodologies.
Checkpoints and Additional Resources
For those interested in exploring the performance firsthand, ECT provides access to checkpoints, such as the CIFAR-10 $\mathrm{FD}_{\text{DINOv2}}$ checkpoint, adding immense value for researchers and developers.
Community and Contact
ECT encourages community interaction, offering contact points including an email address and social media handles for support, questions, or collaboration interests. This openness invites a broader engagement with the project and its future developments.
Citation and Credits
The project is open for academic use, and appropriate references are provided for citation, emphasizing the collaborative and open-source nature of ECT's development effort.