custom-diffusion - Fine-tune Text-to-Image Models Efficiently with Custom Diffusion

Introduction to Custom Diffusion

Custom Diffusion is an innovative approach designed to fine-tune text-to-image diffusion models efficiently, such as the well-known Stable Diffusion. This method cleverly modifies only a small portion of the model parameters, specifically the key and value projection matrices in the cross-attention layers, to incorporate new concepts into the model's vocabulary. As a result, additional storage required per new concept is significantly reduced to just 75MB. Custom Diffusion is now integrated into popular platforms like diffusers, enhancing ease of use for developers.

Background and Functionality

Custom Diffusion utilizes a compact training process that requires merely 4 to 20 images of a new concept. Remarkably, this process is swift, taking about 6 minutes on two A100 GPUs. By refining only certain parameters, the method maintains a balance between swift model updating and reduced computational requirements. This advancement is beneficial for generating new images contextualized with either single or multiple concepts, such as combining a novel object with an artistic style.

Expanded Possibilities with Multi-Concept Customization

Through Custom Diffusion, multiple concepts can be blended seamlessly, enabling creative combinations like a new object with an artistic style or multiple novel objects. This potential opens up a myriad of possibilities for unique image creations, enriching fields like digital art and graphic design.

Methodology

The method operates by fine-tuning select weights in a pre-trained diffusion model. Additionally, a set of 200 regularization images is employed to prevent overfitting, ensuring the model remains versatile. For personalized categories, names are prefixed with the "V*" token, like "V* dog," for clear categorization.

Custom Diffusion supports merging fine-tuned models through an optimization strategy, allowing even more complex multi-concept image generation. The details of this methodology are comprehensively documented in their supporting paper.

Practical Application and Results

The approach yields impressive results across various image categories, such as scenes, pets, personal toys, and artistic styles. Results and additional comparisons are showcased on their webpage and gallery.

Technical Implementation

To start using Custom Diffusion, users can clone the official GitHub repository and set up their environment following provided instructions. They are guided to download and deploy the Stable Diffusion model checkpoint. The structure supports both single- and multi-concept training configurations. For added ease, regularization images can either be real-world photographs or model-generated samples.

Integration with Diffusers

Custom Diffusion's recent integration with the diffusers library further expands its accessibility. Users can find training and inference details in the diffusers GitHub repository.

Fine-Tuning Notes

For fine-tuning, especially concerning human faces, adjustments in learning rates and training steps are recommended to achieve optimal results. Model compression techniques are also available, which significantly reduce model size while maintaining output quality.

References and Attribution

Custom Diffusion is a collaboration among researchers from esteemed institutions and is partially supported by Adobe Inc. The project owes its comprehensive development to valuable feedback from colleagues and the use of resources like Unsplash for datasets. For more technical references and acknowledgments, consult the publication in CVPR 2023 and the dedicated project webpage.

The project stands as a testament to the possibilities that emerge when leveraging minimal data for maximal creative output in text-to-image transformations.