FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention
FastComposer is a groundbreaking project developed to tackle the inefficiencies and limitations of existing text-to-image generation models, particularly when it comes to creating personalized and multi-subject images. This innovative tool introduces a tuning-free approach, making the generation process highly efficient and versatile.
Abstract
Diffusion models have shown great potential in generating images from text prompts, especially for creating images tailored to specific individuals or subjects. However, existing methods require subject-specific fine-tuning, which is resource-intensive and slows down deployment. These methods also encounter difficulties in generating images with multiple subjects, often resulting in mixed features between different subjects.
FastComposer addresses these issues by enabling efficient, personalized image creation without the need for fine-tuning. It achieves this by using subject embeddings from an image encoder, which are integrated into the diffusion models. This allows for personalized image generation with just forward passes and combines subject images with textual instructions effectively.
To solve the problem of identity blending in multi-subject generation, FastComposer incorporates cross-attention localization supervision during its training phase. This technique ensures the attention is directed to correct subject regions in the generated images. Additionally, to prevent overfitting of subject features, FastComposer employs delayed subject conditioning during the denoising step. This maintains both the identity and the ability to edit images during subject-driven generation.
The results are impressive, as FastComposer can generate images of various unseen individuals under different styles, actions, and contexts. Remarkably, it achieves a speedup of 300x-2500x compared to traditional fine-tuning methods, and it doesn't require any additional storage for new subjects. FastComposer embodies a leap forward in the field of image generation, paving the way for more efficient, high-quality personalized image creation.
How to Use FastComposer
Environment Setup
To get started with FastComposer, you need to set up your environment. First, create a new conda environment and activate it:
conda create -n fastcomposer python
conda activate fastcomposer
Next, install the necessary packages:
pip install torch torchvision torchaudio
pip install transformers==4.25.1 accelerate datasets evaluate diffusers==0.16.1 xformers triton scipy clip gradio facenet-pytorch
Finally, install FastComposer:
python setup.py install
Download Pre-trained Models
You can download the pre-trained models by executing the following:
mkdir -p model/fastcomposer ; cd model/fastcomposer
wget https://huggingface.co/mit-han-lab/fastcomposer/resolve/main/pytorch_model.bin
Run the Gradio Demo
A demo is available online here. To run the demo locally, use the following command:
python demo/run_gradio.py --finetuned_model_path model/fastcomposer/pytorch_model.bin --mixed_precision "fp16"
Inference
To perform inference using FastComposer, execute:
bash scripts/run_inference.sh
Evaluation
For evaluation purposes, run the following commands:
python evaluation/single_object/run.py --finetuned_model_path model/fastcomposer/pytorch_model.bin --mixed_precision "fp16" --dataset_name data/celeba_test_single/ --seed 42 --num_images_per_prompt 4 --object_resolution 224 --output_dir OUTPUT_DIR
python evaluation/single_object/single_object_evaluation.py --prediction_folder OUTPUT_DIR --reference_folder data/celeba_test_single/
Training
To prepare for training, download and extract the FFHQ training data:
cd data
wget https://huggingface.co/datasets/mit-han-lab/ffhq-fastcomposer/resolve/main/ffhq_fastcomposer.tgz
tar -xvzf ffhq_fastcomposer.tgz
Then, run the training script:
bash scripts/run_training.sh
Future Work and Citation
The team behind FastComposer continues to work on releasing additional evaluation code and data. If you find this project beneficial or use it in your research, please cite their paper:
@article{xiao2023fastcomposer,
title={FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention},
author={Xiao, Guangxuan and Yin, Tianwei and Freeman, William T. and Durand, Frédo and Han, Song},
journal={International Journal of Computer Vision},
year={2024}
}
FastComposer stands as a remarkable step forward in the realm of personalized and multi-subject image generation. With its efficient approach and high-quality results, it sets a new benchmark for future developments in this field.