VideoElevator - Improve Video Generation with Text-to-Image Models

VideoElevator: Enhancing Video Quality Through Advanced Diffusion Models

VideoElevator is an innovative project that focuses on improving the quality of generated videos by utilizing sophisticated text-to-image diffusion models. This project is notable for being both "training-free" and "plug-and-play," making it highly accessible and versatile for various applications involving text-to-video (T2V) and text-to-image (T2I) models.

Project Highlights

Release Date: On July 4, 2024, the full code for VideoElevator was made publicly available, accompanied by three illustrative scripts to demonstrate its capabilities.

Methodology

At its core, VideoElevator tackles the challenge of video generation by introducing a novel approach that separates the process into two distinct phases: temporal motion refinement and spatial quality enhancement.

Traditional T2V Approach: Traditionally, T2V methodologies handle both temporal (movement over time) and spatial (quality and detail of images) aspects simultaneously. This often results in low-quality content being produced throughout the video generation process.
VideoElevator Approach: The VideoElevator methodology, however, explicitly decouples these elements. It first focuses on refining temporal motion by leveraging T2V models to ensure smooth transitions and consistency over time. Subsequently, it enhances spatial quality by employing T2I models to bring more detailed and accurate visuals, such as ensuring characters are properly illustrated with specific attributes like wearing a suit. This two-step process significantly elevates the quality of the final video output.

Getting Started

Download Required Weights

VideoElevator requires several pre-trained weights for both T2V and T2I diffusion models. These can be easily downloaded into a specified directory (checkpoints/), and users can select the appropriate weights according to their specific needs.

Text-to-Video Models: Examples include LaVie, ZeroScope, and AnimateLCM.
Text-to-Image Models: Popular versions such as StableDiffusion v1.5 and v2.1-base are also supported.
Additionally, optional models from Civitai such as RCNZ Cartoon and RealisticVision can be used.

Installation Requirements

To set up VideoElevator, users should create a dedicated Python environment and install the necessary packages through provided requirements, ensuring a smooth and efficient setup process.

Running Inference

VideoElevator offers a set of example scripts located in the example_scripts/ directory. A recommended script for enhanced text-to-video creation is sd_animatelcm.py. Users can execute this script with minimal hardware requirements, such as GPUs with less than 11 GB of VRAM, including models like the 2080Ti.

Optional Customization

Users have the flexibility to adjust several hyper-parameters to tailor the video generation process:

stable_steps: This modifies the choice of timestep in refining motion.
stable_num: This determines the number of steps for denoising in the T2V process.

These parameters can be fine-tuned to achieve different results, with guidance available through ablation studies on the project’s webpage.

Acknowledgements

VideoElevator builds upon and integrates efforts from several other projects such as Diffusers, LaVie, AnimateLCM, and FreeInit. These contributions have been invaluable in the development and success of VideoElevator.

Through these capabilities, VideoElevator stands as a robust and efficient tool for enhancing video quality with precise and detailed visual representation, transforming how videos are generated from text prompts.