MOFA-Video: A New Approach to Image Animation
Overview
MOFA-Video is an innovative project that tackles the challenge of animating static images by using a generative approach in video diffusion models. The project, prominently featured in the European Conference on Computer Vision (ECCV) 2024, has been developed through a collaboration between researchers from The University of Tokyo and Tencent AI Lab. MOFA-Video focuses on converting images into animated videos by adapting motion fields, allowing for controllable image animation using diverse motion control signals.
Key Features and Updates
-
Keypoint-based Facial Animation: The team introduced an inference script specifically designed for animating facial images using keypoint data. This adds an exciting dimension to facial animation, making it more realistic and dynamic.
-
Training Code Availability: For those interested in the technical aspect, the project has released a comprehensive training code that supports trajectory-based image animation. This aspect empowers researchers and developers to train models effectively on their platforms.
-
Hybrid Controls Release: MOFA-Video supports animations using hybrid control mechanisms, combining different types of signals such as trajectory and keypoint sequences. This version is now supported by Gradio inference code and provides a lot of flexibility in creating realistic animations.
-
Coming Soon - Online Demo: An easy-to-access online demo via HuggingFace Spaces is currently in the works, allowing users to experience the capabilities of MOFA-Video directly in their browsers.
How MOFA-Video Works
The project cleverly adapts motions from various domains to integrate with a frozen video diffusion model using two primary techniques:
-
Sparse-to-Dense (S2D) Motion Generation: This method helps generate dense animation from sparse input signals.
-
Flow-based Motion Adaptation: This technique ensures that the fluid motion is translated well into compelling video sequences.
The process involves generating sparse control signals during training and employing these to control the diffusion model effectively during the inference stage.
Getting Started
To use MOFA-Video, users can follow these primary steps:
-
Clone the Repository: Begin by cloning the GitHub repository to get the necessary software.
-
Environment Setup: Prepare the environment by setting up a Python virtual environment and installing all required packages, including GPU support through CUDA.
-
Downloading Checkpoints: Secure the necessary model checkpoints from the HuggingFace repository to utilize pre-trained models for animation.
-
Run Demos: Users can run Gradio demos to test the animation capabilities using either audio-driven or video-driven facial animations.
Future Developments
MOFA-Video is continuously evolving, with plans to release training scripts for keypoint-based facial image animations in the near future. For collaborators, it opens up potential customization by training personal MOFA-Adapters to achieve unique animation effects.
Acknowledgements
The success of MOFA-Video is credited to the supportive spirit and code contributions of featured projects like DragNUWA, SadTalker, AniPortrait, and many others. These projects have laid the groundwork that MOFA-Video builds upon, showcasing the power of collaborative innovation in the AI community.
For those interested, a more detailed walkthrough and visual gallery can be found on the project page. This resource provides additional insights and practical examples of the work in action.
Citation
Please refer to the project's preprint on arXiv for academic citation:
@article{niu2024mofa,
title={MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model},
author={Niu, Muyao and Cun, Xiaodong and Wang, Xintao and Zhang, Yong and Shan, Ying and Zheng, Yinqiang},
journal={arXiv preprint arXiv:2405.20222},
year={2024}
}
MOFA-Video stands as a testament to the innovative strides being made in image animation, bridging the gap between static and dynamic visual experiences through advanced machine learning techniques.