MimicMotion - Seamless Human Motion Video Synthesis with Enhanced Pose Control

Introduction to MimicMotion

MimicMotion is a groundbreaking project in the field of video generation technology, focusing on creating high-quality human motion videos. Designed by a team of researchers from Tencent and Shanghai Jiao Tong University, this project aims to overcome some of the most significant challenges in video generation today, such as enhancing controllability, extending video length, and enriching detail.

Key Features

MimicMotion stands out with several innovative aspects:

Confidence-aware Pose Guidance: This feature ensures smooth temporal transitions in the video. By integrating confidence levels into pose estimations, MimicMotion achieves more stability and robustness, particularly when trained on extensive data sets.
Regional Loss Amplification: This technique targets areas prone to distortion, using pose confidence to mitigate these effects, resulting in clearer and more visually pleasing images.
Progressive Latent Fusion Strategy: It allows for the generation of longer videos without significantly increasing resource consumption. This strategy cleverly combines different video segments, resulting in a seamless viewing experience.

Achievements

MimicMotion has demonstrated marked improvements over previous video generation methods through extensive tests and user evaluations. These enhancements are visible in the overall quality of the video, including richer details, smoother transitions, and the ability to produce longer videos.

Recent Updates

As of July 2024, a superior version of the model, labeled 1.1, has been released. This update expanded the video frame capacity from 16 to 72, significantly boosting video quality.

Getting Started

Environment Setup

For users interested in experimenting with MimicMotion, the project supports a setup using Python 3 and Torch 2.x, validated on Nvidia V100 GPUs. Installation of all dependencies can be quickly done using the following commands:

conda env create -f environment.yaml
conda activate mimicmotion

Model Weights and Inference

Download the necessary model weights from Hugging Face. Once downloaded, you can test the model using the provided sample configuration file:

python inference.py --inference_config configs/test.yaml

Requirements

Running the full 35-second demo video with the 72-frame model requires about 16GB of GPU VRAM and about 20 minutes on a 4090 graphics card. The 16-frame model requires 8GB of VRAM, although the VAE decoder demands 16GB, which can alternatively run on a CPU.

Conclusion

MimicMotion represents a significant leap forward in video generation technology, providing a robust framework for creating high-quality human motion videos with greater ease. Its innovative approach addresses and overcomes the longstanding challenges in the field, paving the way for more immersive and realistic video content creation. For researchers and developers, MimicMotion offers a comprehensive, flexible tool to explore the frontiers of generative video technology.