SummerTTS: A Tribute to the Summer of 2023
Overview
SummerTTS is an independent, locally-run text-to-speech (TTS) program designed for synthesizing both Chinese and English speech. It stands out due to its ability to function without internet connectivity and lack of external dependencies—simply compile it in one go, and it’s ready for use.
Technical Details
At the heart of SummerTTS is the Eigen library, a collection of template-based functions mostly requiring only header files. This makes SummerTTS a self-sufficient program in a C++ environment, eliminating the need for external dependencies like PyTorch or TensorFlow. The program has successfully run on Ubuntu and should generally be compatible with other Linux-based systems such as Android and Raspberry Pi, although it hasn’t been tested on Windows, which might require some adjustments.
The project’s model builds on the vits speech synthesis algorithm, refined into a C++ environment for enhanced performance.
Features and Updates
- June 16, 2023: Introduction of a faster English speech synthesis model (
single_speaker_english_fast.bin
) with minimal quality compromise. - June 15, 2023: Support for pure English speech synthesis by updating the code and using the
single_speaker_english.bin
model. - June 9, 2023: Added a medium-sized single-speaker model (
single_speaker_mid.bin
) with improved audio quality but slower speed compared to previous models. - June 8, 2023: Modified to support line and entire text synthesis in
test/main.cpp
. - June 2, 2023: Significantly improved polyphonic pronunciation accuracy; new models required for these enhancements.
- May 30, 2023: Integrated WeTextProcessing for better text normalization, improving the pronunciation of numbers, currencies, temperatures, dates, etc.
- May 23, 2023: Speed enhancements for single-speaker models.
- April 21, 2023: Initial creation.
Usage Instructions
To utilize SummerTTS:
- Clone the project code onto a local machine, ideally an Ubuntu Linux system.
- Download the necessary models from the provided Baidu cloud link and place them in the
models
directory. The expected structure is:models/ ├── multi_speakers.bin ├── single_speaker_mid.bin ├── single_speaker_english.bin ├── single_speaker_english_fast.bin └── single_speaker_fast.bin
- Navigate to the
Build
directory and execute the following commands:cmake .. make
- Upon successful compilation, execute the following commands to test speech synthesis:
- For Chinese:
./tts_test ../test.txt ../models/single_speaker_fast.bin out.wav
- For English:
./tts_test ../test_eng.txt ../models/single_speaker_english.bin out_eng.wav
- For Chinese:
System Details
The command parameters include:
- The path to a text file containing the text for speech synthesis.
- The path of the model file, indicating either a single or multiple speaker model.
single_speaker_fast.bin
is recommended for balanced speed and quality. - The path for the output synthetic audio file, playable via any media player.
Developer Details
The synthesis program is implemented in test/main.cpp
, whereas the main interface is defined in include/SynthesizerTrn.h
. Key function:
int16_t * infer(const string & line, int32_t sid, float lengthScale, int32_t & dataLen)
line
: String of text to synthesize.sid
: Speaker ID, only relevant for multi-speaker models.lengthScale
: Controls speech speed; higher values slow speech down.- Handles Arabic numerals and punctuation; English characters are currently not processed in the text normalization module.
Future Development
Plans include open-sourcing modeling training and conversion scripts and developing models with enhanced audio quality.
Contact
For further inquiries, reach out via email at [email protected] or add on WeChat: hwang_2011.
Acknowledgments
Thanks to these projects and datasets for their contributions:
- Eigen
- vits, vits_chinese, MB-iSTFT-VITS
- WeTextProcessing
- Auxiliary projects: glog, gflags, openfst, hanz2piny, cppjieba, g2p_en, English-to-IPA
- Data based on open datasets: AISHELL-3 for multi-speaker, LJ Speech for English single-speaker models.