prompt-to-prompt - Image Editing Innovations with Latent and Stable Diffusion Methods

Prompt-to-Prompt: An Introduction

Latent Diffusion and Stable Diffusion are powerful image generation techniques. The Prompt-to-Prompt project builds upon these techniques to offer a unique approach to image editing through text prompts. This project enables users to edit images non-invasively by manipulating the attention mechanisms inherent in diffusion models.

Setup

The project is implemented in Python 3.8 and utilizes PyTorch 1.11. Pre-trained models are accessed via huggingface/diffusers. Specifically, the project focuses on two types of diffusion models: Latent Diffusion and Stable Diffusion. The required packages needed to run the project are listed in the requirements file. The code has been tested on a Tesla V100 GPU with 16GB VRAM but should also be compatible with other GPUs possessing at least 12GB VRAM.

Quickstart

To quickly understand how the Prompt-to-Prompt project works, it is recommended to explore the provided Jupyter notebooks: prompt-to-prompt_ldm and prompt-to-prompt_stable. These notebooks demonstrate end-to-end examples utilizing the prompt-to-prompt functionality over Latent Diffusion and Stable Diffusion, respectively. They offer a practical introduction to prompt edits and the API used for such edits.

Prompt Edits

The project innovates by employing the abstract class AttentionControl, which is central to applying different prompt edits. Each attention layer within the diffusion model uses this class to alter the attention weights during image generation:

class AttentionControl(abc.ABC):
    @abc.abstractmethod
    def forward (self, attn, is_cross: bool, place_in_unet: str):
        raise NotImplementedError

The principal idea is to modify attention weights to achieve various styles of image editing. Different types of prompt edits can influence the attention weights differently, as outlined below:

Replacement: Swapping original prompt tokens with new ones. For example, changing "A painting of a squirrel eating a burger" to "A painting of a squirrel eating a lasagna". This is managed through an AttentionReplace class.
Refinement: Adding new descriptive tokens to enhance the prompt, such as "A watercolor painting of a squirrel eating a burger." This is managed by the AttentionEditRefine class.
Re-weight: Adjusting the influence of certain tokens within the prompt. For instance, in "A photo of a poppy field at night", changing how prominently night impacts the final image. This is managed by an AttentionReweight class.

Attention Control Options

Several options are available for controlling the attention mechanism:

cross_replace_steps: This option determines the fraction of steps to modify cross-attention maps, potentially on a per-word basis in the prompt.
self_replace_steps: Specifies the fraction of steps to alter the self-attention maps.
local_blend (optional): A LocalBlend object allows for local modifications, initialized with words from the prompt that correspond to specific image areas for targeted editing.
equalizer: Used exclusively for attention re-weighting, this vector scales the cross-attention weightings.

Additional Tools and Options

The project also features a tool for editing real images called Null-Text Inversion, providing a mechanism for text-based adjustments of realistic images using the Stable Diffusion model. The optimization process builds on DDIM inversion by fine-tuning the null-text embedding for guided diffusion models.

Disclaimer

Please note that this is not an officially supported Google product.

By delving into these resources and experimenting with the project setup, users can harness the power of Prompt-to-Prompt to create compelling image edits based on textual cues.