Wuerstchen - Innovative Framework for Text-Conditional Model Training with Enhanced Compression

Overview of the Würstchen Project

Würstchen represents a fresh and innovative approach to training text-conditional models for image generation. Traditionally, these models are computationally intensive and often undergo a single-stage compression process. Würstchen stands out by introducing an additional compression stage, enabling a significant reduction in computational requirements without compromising the quality of image reconstruction.

How Würstchen Works

Würstchen employs a multi-stage architecture, involving three key stages:

Stage A & B: These stages are tasked with the initial compression of images. They effectively reduce the size of the images into a compressed format.
Stage C: The innovation of Würstchen shines through at this stage, where it focuses on learning the text-conditional aspects within this newly formed, low-dimensional latent space. The result is an impressive 42-fold compression factor, which allows for quick and cost-effective training of Stage C while maintaining high image fidelity.

This system dramatically enhances the efficiency of the training process, making it both faster and less resource-demanding compared to other methods.

Get Started with Würstchen

For users looking to harness the power of Würstchen, there are several ways to do so:

Reconstruction and Generation: Through provided notebooks, users can engage with the model. The Stage B notebook is available for image reconstruction, while the Stage C notebook covers text-conditional generation.
Google Colab: Users can also explore text-to-image generation via a dedicated Google Colab link.

Integrating with Diffusers

Würstchen is fully integrated with the diffusers library, which is available through Hugging Face. This integration allows easy use of the model for text-to-image generation:

# To install the necessary packages
# pip install -U transformers accelerate diffusers

import torch
from diffusers import AutoPipelineForText2Image
from diffusers.pipelines.wuerstchen import DEFAULT_STAGE_C_TIMESTEPS

pipe = AutoPipelineForText2Image.from_pretrained("warp-ai/wuerstchen", torch_dtype=torch.float16).to("cuda")

caption = "Anthropomorphic cat dressed as a fire fighter"
images = pipe(
    caption, 
    width=1024,
    height=1536,
    prior_timesteps=DEFAULT_STAGE_C_TIMESTEPS,
    prior_guidance_scale=4.0,
    num_images_per_prompt=2,
).images

For additional details, the official documentation is an excellent resource.

Training Your Own Würstchen Model

The training of Würstchen is designed to be more efficient than other text-to-image frameworks due to its use of a much smaller 12x12 latent space. Users can access training scripts for both Stage B and Stage C.

Accessing Würstchen Models

Users can download pre-trained Würstchen models with varying specifications:

Würstchen v1: Available on Hugging Face, this version includes parameters for CLIP-H-Text, trained over 800,000 steps, with a resolution of 512x512.
Würstchen v2: Also on Hugging Face, it features CLIP-bigG-Text conditioning, trained over 918,000 steps, and supports a resolution of 1024x1024.

Acknowledgments and Citation

The Würstchen project thanks Stability AI for their computational support. For researchers and developers inspired by or utilizing Würstchen, the project provides a suggested citation:

@inproceedings{
      pernias2024wrstchen,
      title={W\"urstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models},
      author={Pablo Pernias and Dominic Rampas and Mats Leon Richter and Christopher Pal and Marc Aubreville},
      booktitle={The Twelfth International Conference on Learning Representations},
      year={2024},
      url={https://openreview.net/forum?id=gU58d5QeGv}
}

In summary, Würstchen offers a powerful, efficient, and cost-effective framework for text-conditional image generation, setting a new benchmark in terms of speed and resource efficiency.