TokenFlow - Text-to-Video Editing Using Diffusion Models

TokenFlow: Consistent Diffusion Features for Consistent Video Editing

TokenFlow, a cutting-edge framework, provides an innovative approach to video editing by leveraging a pre-trained text-to-image diffusion model. Uniquely, this framework requires no additional training or fine-tuning of the model to deliver remarkable video edits.

Introduction to TokenFlow

The world of generative AI has recently expanded its innovations to video editing. However, existing video models still lag behind image models in terms of visual quality and the degree of user control over the generated content. TokenFlow addresses these challenges by utilizing the strengths of text-to-image diffusion models for enhancing text-driven video editing. Given a source video and a target text prompt, TokenFlow generates high-quality videos that maintain the spatial layout and dynamics of the original input while aligning with the content described in the text.

Key Features

TokenFlow stands out due to its ability to ensure consistency in edited videos by maintaining consistency within the diffusion feature space. This is achieved through the propagation of diffusion features according to inter-frame correspondences, which are naturally available in the model. As a result, TokenFlow does not need extra training or fine-tuning, and it works seamlessly with various off-the-shelf text-to-image editing methods.

Sample Results

The project showcases impressive editing results on a variety of real-world videos, demonstrating the framework's effectiveness in preserving video structure while incorporating new content based on textual prompts.

Environment Setup

To use TokenFlow, users need to set up the environment by running the following commands:

conda create -n tokenflow python=3.9
conda activate tokenflow
pip install -r requirements.txt

Preprocessing

Users can preprocess their videos using a specified command, which allows the definition of video parameters such as height, width, and the version of the stable diffusion model. A key aspect of successful video editing with TokenFlow is achieving a good reconstruction of the input video, which will be saved as inverted.mp4.

Editing Process

TokenFlow is designed for structure-preserving video edits. It operates atop existing image editing techniques such as Plug-and-Play or ControlNet. Therefore, it’s essential to ensure compatibility with the chosen base editing technique.

For video editing, users should create a YAML configuration similar to configs/config_pnp.yaml and execute the corresponding script, adapting it for different techniques as needed (e.g., ControlNet or SDEedit).

Conclusion

TokenFlow is a groundbreaking framework for text-driven video editing, providing consistent and high-quality edits without requiring additional training. By focusing on diffusion feature consistency, it offers an effective solution to enhance video editing capabilities, making it a valuable tool for creators looking to explore the intersection of AI and video content creation.

For further information and access to sample results, visit the project webpage.