rich-text-to-image - Improve text-to-image generation accuracy with rich text formatting options

Rich-Text-to-Image: A Comprehensive Guide

Rich-Text-to-Image is an innovative project that enhances the text-to-image generation process by incorporating various formatting features of rich text. Developed by Songwei Ge, Taesung Park, Jun-Yan Zhu, and Jia-Bin Huang, this project aims to bring a higher level of detail and control to image generation by leveraging formatting attributes such as font size, color, style, and footnotes.

Overview of the Project

The core idea behind Rich-Text-to-Image is to use the diverse formatting options available in rich text to exert finer control over the image generation process. By employing these formatting details, the method enables explicit token reweighting, precise color rendering, local style control, and detailed region synthesis. This allows users to generate more nuanced and precise images from textual descriptions.

Key Features

Font Color Control: Users can specify exact colors for different elements described in the text. For example, a command can be given to render a Gothic church in a specific shade like #b26b00.
Footnotes for Detail: By adding footnotes to text elements, additional descriptive details can be provided. This feature is particularly useful for generating complex images, such as a cat wearing sunglasses and a bandana.
Font Style for Artistic Flair: The font style can define the artistic style of specific image areas. For instance, a garden can be depicted in the style of Claude Monet, with different styles applied to other parts of the image.
Font Size for Emphasis: Larger font sizes indicate a higher emphasis or weight on specific tokens, affecting their prominence in the final image. This is useful for adjusting the relative importance of elements, like adding more pineapple to a pizza.

Getting Started

To explore Rich-Text-to-Image, users can integrate the system into their applications using Python 3.8 and Pytorch 1.11. The project supports several diffusion models such as Stable Diffusion and ANIMAGINE-XL. The setup involves cloning the repository, creating an environment, and installing required packages.

Usage

The image generation process involves two main steps:

Inputting a plain text prompt into the diffusion model to create cross-attention maps.
Using rich-text prompts to control attributes for different text spans, which are stored in a JSON format.

The project offers various methods to generate images using rich-text JSON inputs, either through a local demo app using Gradio or directly via command-line inputs.

Evaluation and Visualization

To assess the performance of the Rich-Text-to-Image system, users can run evaluation scripts that compare the system's capabilities with other methods in local style generation and precise color reproduction. Additionally, the system includes tools for visualizing token maps, helping to understand how text descriptions translate into image features.

Contributions and Support

The project has gained recognition and support from various collaborators, including individuals from Adobe and Carnegie Mellon University. Contributions from the HuggingFace team have been crucial in developing the online demo, while the underlying rich-text editor is based on Quill, and the model code relies on the HuggingFace diffusers library.

In conclusion, Rich-Text-to-Image represents a significant advancement in text-to-image generation technology, offering unprecedented control and detail through the use of rich text formatting.