mandarin-tts - Flexible Text-to-Speech Framework for Mandarin Language

Introduction to Mandarin-TTS Project

The Mandarin-TTS (MTTS) project is a comprehensive framework designed to convert Mandarin text into speech. This framework is particularly beneficial for researchers and developers looking to advance their projects swiftly. With its modular structure and a variety of features, MTTS provides flexibility and ease of use for customizing speech synthesis applications.

Key Features

Configurable Modules: MTTS allows users to configure all its modules through YAML files, making it easy to tailor the framework to specific needs without altering the source code extensively.
Versatile Embeddings: This framework supports speaker embeddings, prosody embeddings, and multi-stream text embeddings. These features ensure high adaptability and the ability to fine-tune the voice output according to different variables such as speaker identity and expressiveness.
Diverse Vocoder Options: MTTS supports several vocoders like VocGAN, HiFi-GAN, Waveglow, and MelGAN. With an adaptable interface, users can easily compare different vocoders to find the best-suited option for their purposes.
Predictive Features: It includes duration, pitch, and energy variance predictors. The framework is also designed to easily incorporate additional variants and enhance output quality.
Contribution Friendly: MTTS is open to contributions, inviting developers and researchers to enhance its functionalities further.

Audio Samples

To listen to what MTTS can do, check out their demo showcasing audio samples from the Aishell3 dataset on Bilibili. The GitHub page also hosts samples for both the Biaobei and Aishell3 datasets.

Getting Started

Installation

MTTS can be easily installed by cloning the repository and setting it up using Python's pip package manager:

git clone https://github.com/ranchlai/mandarin-tts.git
cd mandarin-tts
git submodule update --force --recursive --init --remote
pip install -e .

Training Your Model

To train a model, MTTS provides examples like Biaobei and Aishell3. Here's a brief walkthrough:

Prepare Melspectrogram Features: Navigate to the examples folder, and run:

python wav2mel.py -c ./aishell3/config.yaml -w <aishell3_wav_folder> -m <mel_folder> -d cpu

Prepare SCP Files: Generate the necessary SCP files with:

python prepare.py --wav_folder <aishell3_wav_folder>  --mel_folder <mel_folder> --dst_folder ./train/

Begin Training: Start the training process:

python ../../mtts/train.py -c config.yaml -d cuda

The Biaobei example follows a similar process, although it does not include speaker embeddings but does allow for prosody embedding.

Synthesis

Pretrained Checkpoints

MTTS provides pre-trained checkpoints for different datasets available at Zenodo. Users can find both checkpoints and configuration files for Aishell3 and Biaobei datasets with links provided in the documentation.

Supported Vocoders

MTTS uses various vocoders to convert melspectrograms to audio waveforms. Vocoders are included as submodules and require manual checkpoint download. Supported vocoders include Waveglow, HiFi-GAN, VocGAN, and MelGAN, each with their respective GitHub repositories.

Synthesizing Speech

Once the configurations and checkpoints are in place, users can synthesize speech using:

python ../../mtts/synthesize.py  -d cuda --c config.yaml --checkpoint ./checkpoints/checkpoint_1240000.pth.tar -i input.txt

Successful synthesis will generate audio examples in the specified output folder.

MTTS presents itself as a robust platform for Mandarin text-to-speech synthesis. It offers flexibility, a variety of options, and an easy-to-use framework that encourages both development and contribution.