Ranni - Optimizing Text-to-Image Diffusion for Improved Prompt Precision

Introduction to Ranni

Ranni is an innovative project that focuses on enhancing the accuracy of text-to-image diffusion processes, making the conversion from text instructions to visual art more precise and effective. This project has been officially recognized in the CVPR 2024 paper titled "Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following," developed by a team of researchers from Alibaba Group and Ant Group.

Components of the Ranni Project

Ranni comprises two main components:

LLM-based Planning Model: This component is responsible for interpreting text instructions and transforming them into visual elements that can be represented in an image. It leverages the powerful capabilities of Large Language Models (LLMs) to ensure an accurate understanding of the semantics involved.
Diffusion-Based Painting Model: Following the planning stage, this model takes over to generate the final image by accurately following the visual cues and elements outlined in the initial stage.

Together, these components enable Ranni to achieve improved semantic understanding and image generation.

Achievements and Features

CVPR 2024 Recognition: Ranni has been accepted as a CVPR 2024 oral paper, highlighting its significance in the field of computer vision and artificial intelligence.
Model Releases: The project has released the weights of the optimized models used, including a LoRA-finetuned LLaMa-2-7B and a fully-finetuned Stable Diffusion v2.1 model.

Getting Started with Ranni

Installation

For setting up the Ranni environment, users can easily install the required dependencies using Conda:

conda env create -f environment.yaml
conda activate ranni

Downloading Checkpoints

The model checkpoints are available for download and should be placed in the designated models directory for proper functionality:

models/
  llama2_7b_lora_bbox.pth
  llama2_7b_lora_element.pth
  ranni_sdv21_v1.pth

Interactive Image Generation Demo

Ranni provides an interactive demo using Gradio, allowing users to input image prompts and witness the real-time generation of images. By following simple steps—first generating a semantic panel and then the corresponding image—users can create and visualize art based on textual prompts, such as "A black dog and a white cat."

Continuous Editing Features

The system also supports continuous editing, where users can modify elements in the generated image by adjusting prompts or visual components. For example, users can alter a prompt from "black dog" to "white dog," and Ranni will update the image accordingly while preserving contextual consistency.

Acknowledgements

Ranni is built upon foundational codebases from projects like Stability AI's stable diffusion and lllyasviel's ControlNet. These resources have been instrumental in the project's development.

In summary, Ranni is a pioneering effort in utilizing advanced AI models to bridge the gap between textual descriptions and visual artistry, offering tools for refined and interactive image creation.