vits_chinese - Innovative TTS Methods with BERT and VITS for Authentic Voice Generation

Introduction to the VITS_Chinese Project

The VITS_Chinese project is an innovative endeavor focused on advancing Text-to-Speech (TTS) algorithms through a blend of cutting-edge technologies such as BERT and the VITS framework, while integrating certain natural speech features developed by Microsoft. This project is specifically designed for educational purposes, offering a deep dive into the intricacies of TTS systems rather than being directly suited for production environments.

Key Features and Technologies

Prosody Embedding with BERT: The project utilizes BERT for hidden prosody embedding. This approach helps in achieving natural pauses in speech by understanding grammatical structures, which enhances the overall fluency and expressiveness of the generated speech.
Infer Loss from NaturalSpeech: Integrating loss inference techniques from Microsoft's NaturalSpeech ensures reduced sound errors. This leads to clearer and more natural-sounding audio output.
High-Quality Audio with VITS Framework: By adopting the VITS (Variational Inference Text-to-Speech) framework, the project delivers high-quality audio, leveraging advanced techniques in the domain of speech synthesis.
Module-wise Distillation for Speed: The project implements module-wise distillation to accelerate the training process, allowing for faster development cycles without compromising on quality.

Recommendations for Model Training

A recommended practice within the project is fine-tuning the model using 'Infer Loss' after the initial training phase. Users are advised to freeze the PosteriorEncoder during fine-tuning to optimize performance. Experimentation is encouraged with adjusting loss_kl_r to refine audio quality further.

Comparative Notable Insights

The VITS_Chinese project currently does not transition to VITS2, primarily because VITS2 replaces the Flow’s WaveNet module with Transformer, which may not align with the project's aim for streamlined, CNN-based implementations in TTS systems.

Practical Demonstrations and Accessibility

The project hosts an online demonstration for users to experience the system firsthand, providing a tangible insight into its potential and capabilities:

Hugging Face Spaces Online Demo

Installation and Setup

To get started, users need to install the necessary dependencies and setup the MAS alignment through the following commands:

pip install -r requirements.txt
cd monotonic_align
python setup.py build_ext --inplace

Model Inference

For inference, pretrained models are readily available from the release page. Users can follow these steps:

Add the prosody_model.pt to ./bert/prosody_model.pt
Add the vits_bert_model.pth to ./vits_bert_model.pth

Execute the following for inference:

python vits_infer.py --config ./configs/bert_vits.json --model vits_bert_model.pth

Advanced Inference Options

The project supports chunked wave streaming inference, allowing for real-time, low-latency applications:

python vits_infer_stream.py --config ./configs/bert_vits.json --model vits_bert_model.pth

Enhancements and Adaptations

Non-BERT Inference: A specialized inference module allows usage without BERT, providing versatility for devices with lower computational resources.
ONNX Export and Streaming: The project supports exporting models to ONNX for non-streaming and streaming applications, enhancing device compatibility.

Training and Compression Techniques

The project extends its scope into model compression through knowledge distillation strategies, effectively reducing model size while increasing operational speed by threefold.

To explore training:

python train.py -c configs/bert_vits_student.json -m bert_vits_student

Notable Credits and Contributors

The VITS_Chinese project is built upon an amalgamation of various existing technologies and research, crediting significant contributions from:

Microsoft's NaturalSpeech
Fastspeech2 and related BERT prosody research
The ONNX and Android contributions for further device compatibility

For more information and resources, visit the project's GitHub repository.

This detailed introduction offers a comprehensive glimpse into the rich features and capabilities of the VITS_Chinese project, painting a picture of a resource-rich platform for TTS algorithm studies and experimentation.