StableVideo - Consistency-aware Video Editing Powered by Text-driven Diffusion

Introducing StableVideo

StableVideo is a pioneering project in the field of video editing, presented at ICCV 2023. It's designed to offer text-driven, consistency-aware diffusion video editing, bridging the gap between state-of-the-art technology and user-friendly video modification. Developed by a team of researchers including Wenhao Chai, Xun Guo, Gaoang Wang, and Yan Lu, StableVideo capitalizes on cutting-edge AI methods to transform video editing into an intuitive, yet powerful, tool.

Key Features

Text-Driven Video Editing: At the heart of StableVideo's innovation is its ability to allow users to edit videos using simple text commands. This feature harnesses natural language processing capabilities to make modifications such as filtering, enhancement, or significant alterations without the need for complicated manual editing processes.
Consistency-Aware Diffusion: The project emphasizes maintaining visual consistency across frames. By understanding the spatial and temporal coherence in videos, StableVideo ensures smooth transitions between modifications, providing seamless visual outputs.

Installation and Requirements

To get started with StableVideo, you need to fulfill certain hardware requirements due to its intensive computational demands. It can be run on different configurations depending on the available resources:

VRAM Requirements: Depending on the precision and computational strategy chosen (like using float32 or mixed precision), the VRAM requirement ranges from 14,185 MiB to 29,145 MiB. These configurations help optimize performance based on the available hardware.

Installing StableVideo involves cloning its repository, setting up a Python environment, and installing necessary libraries:

git clone https://github.com/rese1f/StableVideo.git
conda create -n stablevideo python=3.11
pip install -r requirements.txt

For flexibility, there is also a CPU-confined version with available demos via Hugging Face.

Pre-trained Models and Example Videos

Users can download pre-trained models and detectors necessary for video editing from the ControlNet page on Hugging Face. Additionally, example videos are available to experiment with the software's capabilities, including scenarios like car-turn and sports actions, downloadable from a shared Dropbox archive.

These resources enable users to quickly set up environments and test the capabilities without initial training or data preparation.

How to Run

Once everything is set up, StableVideo can be initiated by executing:

python app.py

Upon processing the commands, the resulting video, alongside keyframes, is stored in the designated directory for easy access. Users can also modify the mask region for foreground atlas, although there might be minor interface glitches currently being refined.

Acknowledgments and Citation

StableVideo is built on foundational work from projects like Text2LIVE and ControlNet, demonstrating the collaborative nature of advancements in AI. Interested researchers or developers can contribute or further their own work by referencing StableVideo as follows:

@article{chai2023stablevideo,
  title={StableVideo: Text-driven Consistency-aware Diffusion Video Editing},
  author={Chai, Wenhao and Guo, Xun and Wang, Gaoang and Lu, Yan},
  journal={arXiv preprint arXiv:2308.09592},
  year={2023}
}

StableVideo stands as a beacon of innovation in video editing, offering both power and simplicity, making it accessible for a wide range of users from amateur editors to seasoned video production professionals.