Chinese-FastSpeech2 - Integrating Prosody for Improved Chinese Speech Synthesis with FastSpeech2

Project Overview of Chinese-FastSpeech2

Chinese-FastSpeech2 is an enhanced version of the FastSpeech2 model, specifically adapted for Chinese speech synthesis. The project builds upon the FastSpeech2 model by introducing prosody representation and prediction modules, making the Chinese pronunciation more lively and rhythmic. The project's training utilized the Biaobei Standard Mandarin Female Voice dataset.

Updates as of April 2, 2023

Inclusion of Prosody Training Code: The project now includes code for training prosody models, found under the BertProsody directory.
Data Preprocessing Code for Prosody Training: A data preprocessing script tailored for the Biaobei dataset has been added, located at preprocessor/biaobei.py. Note that this script is currently unrefined but available for initial use.

Samples

The generated audio samples are available for reference to showcase the speech synthesis capabilities of the model.

Model Files

The main architecture of the project consists of FastSpeech2 combined with HifiGAN, enhanced by the inclusion of Chinese text prosody vectors at the input stage. As a result, the project comprises three models:

fastspeech_model (File: 8000.pth.tar) → Place in output/ckpt/biaobei/
hifigan_model (File: generator_universal.pth.tar) → Place in hifigan/
prosody_model (File: best_model.pt) → Place in transformer/prosody_model/

The models can be downloaded from this link with the extraction code: qgpi.

Prediction Methods

The project provides two methods for synthesizing speech:

Interactive Synthesis: By running python synthesize_all.py, users can input the text to be converted into speech via command line, which will generate a file named tmp.wav in the current working directory.
API Call: Running tts_server.py will launch a text-to-speech interface, which can be accessed via HTTP API as demonstrated in TestServer.py. The resulting audio file (tmp.wav) is also stored in the current working directory.

Training Process

For those interested in custom training, the project references the detailed training methods from the FastSpeech2 project. Chinese-FastSpeech2 includes several optimizations to the base FastSpeech2 methods. For further insights on these improvements, one can refer to the blog: Optimization of Chinese Speech Synthesis based on FastSpeech2.

Chinese-FastSpeech2 is a personal endeavor aimed at exploring advancements in speech synthesis. The project welcomes feedback, critiques, and productive exchanges to continue improving its offerings.