Introduction to Bert-VITS2-ext Project
The Bert-VITS2-ext project aims to extend the capabilities of the Bert-VITS2, particularly by synchronizing Text-To-Speech (TTS) with facial expression data generation. The project showcases its results through various demos available on platforms such as Bilibili and YouTube. Notably, the project explores generating facial expressions from singing and compares it to expressions generated using Azure TTS.
Extended to CosyVoice
The project expanded to include expressive testing with CosyVoice, integrating more sophisticated facial expression generation into the TTS framework.
Extended to GPT-SoVITS
The integration attempts with GPT-SoVITS showed mixed results. While retraining directly on GPT-SoVITS didn't yield great results, merging models from Bert-VITS2-ext helped in conducting tests, albeit causing redundant computations and less accurate predictions.
Text-To-Speech (TTS)
Built upon the v2.3 Final Release of Bert-VITS2, the TTS system may perform differently across versions, particularly with purely Chinese datasets. Users with specific Chinese language needs might consider older or mixed versions for optimal performance. The project provides a detailed methodology for training TTS across different versions, especially recommending the 1.0 version.
Synchronizing TTS with Expressions
Concept
The project draws on the network architecture from the VITS paper, focusing on the transformation of text encoding into latent variables (z) before decoding. This allows the project to generate facial expression values using additional processing layers like LSTM and MLP, independently from the original network.
Data Collection
-
Configure Live Link Face Targets to the local machine IP with a default port of 11111.
-
Simultaneously collect audio and corresponding Live Link-generated expression values, storing them in the records directory.
Script Usage:
python ./motion/record.py
-
Validate data collection by previewing the weights curve in a
.npy
file. -
Test data synchronization by sending recorded data to MetaHuman, ensuring matched audio and visual playback.
Data Preparation
-
Encode audio files from the records directory using a posterior encoder, storing latent variables in
.z.npy
format. -
Prepare training and validation file lists.
Command:
python ./motion/prepare_visemes.py
Training and Inference
Training involves using a dedicated script for differentiating from the main network:
python train_ms.py -m OUTPUT_MODEL --config ./configs/config.json --visemes
For inference, audio and animation data are outputted, with fps matching the 86.1328125 rate derived from the Bert-VITS2 audio sampling frequency.
Visualize the results using:
python ./motion/tts2ue.py --bs_npy_file ./tmp.npy --wav_file ./tmp.wav --delay_ms 700
From Sound to Expression
The approach converts sound to latent variables (z), which are then mapped to expressions. This requires audio to be in 44,100 Hz WAV format with a single channel.
Conversion and Processing Example:
ffmpeg -i input_file -ss 00:00:00 -t 00:00:10 -ar 44100 -f wav test.wav
ffmpeg -i test.wav -map_channel 0.0.0 output.wav
python ./motion/wav_to_visemes.py output.wav
Body Animation
MotionGPT
MotionGPT leverages LLM-generated action descriptions to produce motion matching the speech and expressions, enabling interactive scene choreography. Currently, it supports transitions and is adapted from the MotionGPT project. However, it has limitations in directly interfacing with Unreal Engine (UE) due to skeleton position adaptations, managed through conversion and mapping analytics.
audio2photoreal
Adapting from the original project, local animation data export is facilitated during web interface inference, following instructions here.
Bert-VITS2 Acknowledgement
The Bert-VITS2 utilizes a multilingual BERT model, with its development heavily inspired by MassTTS. The project is an open extension, supporting independent learning and extensions by developers while adhering to legal and ethical standards.
For more information, contributors can be acknowledged here.