Text2Video - Use of Phoneme-Pose Dictionary in Text-Driven Talking-Head Synthesis

Project Overview: Text2Video

Text2Video is an innovative project presented at the 2022 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). This project stands out for its unique approach to generating videos from text inputs. The groundwork for this project lies in leveraging advancements in deep learning technologies to bridge the gap between static text and dynamic video representations.

Introduction to Text2Video

The Text2Video project aims to synthesize videos that correspond to text inputs, particularly focusing on generating talking-head videos. Unlike traditional methods that rely heavily on audio data, this approach creates videos using a phonetic dictionary. At its core, the method involves constructing a phoneme-pose dictionary and training a generative adversarial network (GAN) to produce videos from these interpolated phoneme poses.

Advantages over Traditional Methods

One of the standout features of Text2Video is its efficiency and adaptability:

Reduced Data Requirements: Unlike audio-driven video generation techniques, Text2Video requires significantly less training data, making it more efficient.
Enhanced Flexibility: It is resilient to speaker variations, which can often disrupt audio-driven models.
Time Efficiency: The method reduces the time needed for preprocessing, training, and inference processes, allowing for faster video generation.

Data Handling and Preprocessing

Text2Video employs an organized setup process to manage its data and dependencies effectively. Here's a snapshot of the procedure:

Clone the project repository.
Install necessary packages and tools, such as the modified vid2vid repository, phonetic data, and certain dependencies for handling media files.
Set up the environment by installing essential libraries such as Sox, Zhon, Moviepy, FFMPEG, and Pydub.
For handling Chinese text inputs, additional tools like VOSK for speech recognition, and libraries for processing Chinese language structures are required.

Testing the Application

To test the application, users need to activate a virtual environment specific to the vid2vid tool and then run scripts to generate videos. These scripts allow for video generation with either real audio or text-to-speech (TTS) synthesized audio, supporting both English and Chinese languages. The generation process involves specifying input text, selecting a person, and choosing the gender of the voice.

Training on Custom Data

The project also facilitates training with custom data, enabling users to tailor the model to specific phoneme or pinyin dictionaries. This process involves:

Video recording guided by prompts to capture phoneme or pinyin coverage.
Creating a phoneme-mouth or pinyin-mouth shape dictionary.
Using Openpose for fitting each video frame to capture skeletal models.
Training the vid2vid model with these skeleton representations to produce realistic videos.

Finally, text can be converted to a timestamped phoneme or pinyin list, matched to corresponding 2D skeletal models, and transformed into videos through the vid2vid framework.

Conclusion

Text2Video is a cutting-edge project that highlights the synergy of text processing and generative video modeling. Its efficient data handling, flexibility across different languages and speakers, and fast processing times make it a prominent tool in the realm of text-driven video synthesis. The project opens new avenues for applications where dynamic visual content is necessary from static text inputs, offering a blend of technological innovation and practicality.