GeneFace - Real-Time Audio-Driven 3D Talking Face Generation with Enhanced Lip Synchronization

GeneFace: Revolutionizing Audio-Driven 3D Talking Face Synthesis

GeneFace is a leading-edge project presented by a team from Zhejiang University and ByteDance, focused on the synthesis of 3D talking faces driven by audio, showcased in the ICLR 2023 paper. This project proposes innovative methods for creating lifelike and expressive talking faces that achieve superior lip synchronization with audio inputs, even from out-of-domain audios, which means it can handle a variety of sounds not originally part of its training data.

Core Features and Advancements

GeneFace sets itself apart by offering high-fidelity and generalized results, making it suitable for a wide range of applications. The project highlights an intuitive inference pipeline, which emphasizes improved lip-sync quality when compared to previous methods based on NeRF (Neural Radiance Fields). A demonstration video further illustrates these enhancements through side-by-side comparisons.

MimicTalk and GeneFace++

In addition to the primary GeneFace framework, two enhanced versions, MimicTalk and GeneFace++, have been released:

MimicTalk builds upon NeRF technologies to cater to specific persons’ talking faces, significantly enhancing visual quality and even allowing for style customization.
GeneFace++ advances upon the original GeneFace, offering superior lip-sync accuracy, video quality, and system efficiency.

Recent Updates

March 2023: A substantial update included:
- A RAD-NeRF-based renderer for real-time inference and a drastically reduced training time of 10 hours.
- New PyTorch modules for deeper 3D reconstruction that are both easier to install and operate much faster than previous versions.
- Introduction of a pitch-aware audio-to-motion module for improved lip-sync accuracy.
- Numerous bug fixes aimed at economizing on memory usage.
February 2023: Celebrated for improved video content synchronization, including a demo featuring a Chinese song generated by the DiffSinger system.

Getting Started with GeneFace

GeneFace offers pre-trained models and processed datasets for users eager to experiment. The project provides a simple four-step process to start using these models:

Environment Setup: Create a Python environment following the project's guide.
Data Setup: Download and extract necessary datasets into designated directories.
Data Processing: Follow instructions to process target video data, resulting in essential output files needed for further processing.
Inference Execution: Execute script commands to produce output videos exhibiting the model's abilities.

Custom Model Training

Beyond the example dataset, GeneFace supports training custom models for personalized videos. Instructions are provided to guide users in setting up for new video inputs, allowing users to record their own videos and train unique models.

Acknowledgements and Contributions

GeneFace builds on resources from existing projects, such as NATSpeech, AD-NeRF, RAD-NeRF, and Deep3DFaceRecon for various essential components. These collaborations enriched GeneFace’s capabilities, ensuring robust and high-quality outputs.

GeneFace represents a significant leap forward in audio-driven 3D facial animation, making notable strides in synchronization and expressiveness. It is an indispensable tool for those interested in high-fidelity digital face synthesis, from researchers to multimedia developers.