SummerTTS - Local Voice Synthesis without Network or Extra Dependencies

SummerTTS: A Tribute to the Summer of 2023

Overview

SummerTTS is an independent, locally-run text-to-speech (TTS) program designed for synthesizing both Chinese and English speech. It stands out due to its ability to function without internet connectivity and lack of external dependencies—simply compile it in one go, and it’s ready for use.

Technical Details

At the heart of SummerTTS is the Eigen library, a collection of template-based functions mostly requiring only header files. This makes SummerTTS a self-sufficient program in a C++ environment, eliminating the need for external dependencies like PyTorch or TensorFlow. The program has successfully run on Ubuntu and should generally be compatible with other Linux-based systems such as Android and Raspberry Pi, although it hasn’t been tested on Windows, which might require some adjustments.

The project’s model builds on the vits speech synthesis algorithm, refined into a C++ environment for enhanced performance.

Features and Updates

June 16, 2023: Introduction of a faster English speech synthesis model (single_speaker_english_fast.bin) with minimal quality compromise.
June 15, 2023: Support for pure English speech synthesis by updating the code and using the single_speaker_english.bin model.
June 9, 2023: Added a medium-sized single-speaker model (single_speaker_mid.bin) with improved audio quality but slower speed compared to previous models.
June 8, 2023: Modified to support line and entire text synthesis in test/main.cpp.
June 2, 2023: Significantly improved polyphonic pronunciation accuracy; new models required for these enhancements.
May 30, 2023: Integrated WeTextProcessing for better text normalization, improving the pronunciation of numbers, currencies, temperatures, dates, etc.
May 23, 2023: Speed enhancements for single-speaker models.
April 21, 2023: Initial creation.

Usage Instructions

To utilize SummerTTS:

Clone the project code onto a local machine, ideally an Ubuntu Linux system.

Download the necessary models from the provided Baidu cloud link and place them in the models directory. The expected structure is:

models/
├── multi_speakers.bin
├── single_speaker_mid.bin
├── single_speaker_english.bin
├── single_speaker_english_fast.bin
└── single_speaker_fast.bin

Navigate to the Build directory and execute the following commands:
```
cmake ..
make
```
Upon successful compilation, execute the following commands to test speech synthesis:
- For Chinese: ./tts_test ../test.txt ../models/single_speaker_fast.bin out.wav
- For English: ./tts_test ../test_eng.txt ../models/single_speaker_english.bin out_eng.wav

System Details

The command parameters include:

The path to a text file containing the text for speech synthesis.
The path of the model file, indicating either a single or multiple speaker model. single_speaker_fast.bin is recommended for balanced speed and quality.
The path for the output synthetic audio file, playable via any media player.

Developer Details

The synthesis program is implemented in test/main.cpp, whereas the main interface is defined in include/SynthesizerTrn.h. Key function:

int16_t * infer(const string & line, int32_t sid, float lengthScale, int32_t & dataLen)

line: String of text to synthesize.
sid: Speaker ID, only relevant for multi-speaker models.
lengthScale: Controls speech speed; higher values slow speech down.
Handles Arabic numerals and punctuation; English characters are currently not processed in the text normalization module.

Future Development

Plans include open-sourcing modeling training and conversion scripts and developing models with enhanced audio quality.

Contact

For further inquiries, reach out via email at [email protected] or add on WeChat: hwang_2011.

Acknowledgments

Thanks to these projects and datasets for their contributions:

Eigen
vits, vits_chinese, MB-iSTFT-VITS
WeTextProcessing
Auxiliary projects: glog, gflags, openfst, hanz2piny, cppjieba, g2p_en, English-to-IPA
Data based on open datasets: AISHELL-3 for multi-speaker, LJ Speech for English single-speaker models.