StableTTS - Multilingual Flow-Matching TTS Model Integrating Cutting-Edge Technologies

StableTTS: Revolutionizing Text-To-Speech Technology

Introduction

StableTTS stands as an innovative next-generation text-to-speech (TTS) model. It's the first open-source TTS model to integrate flow-matching with DiT, drawing inspiration from the renowned Stable Diffusion 3. This cutting-edge model is both fast and lightweight, equipped to generate speech in Chinese, English, and Japanese. At the heart of StableTTS is a compact framework of 31 million parameters, ensuring efficient and robust speech synthesis.

Latest Developments

October 2024 Update:

A new autoregressive TTS model is set to debut soon.

September 2024 Release:

StableTTS V1.1 was launched, significantly enhancing audio quality. Key upgrades include fixing critical audio quality issues and incorporating U-Net-like long skip connections to enhance the DiT in the Flow-matching Decoder. Other improvements involve utilizing a cosine timestep scheduler, adding support for Classifier-Free Guidance (CFG) and the FireflyGAN vocoder, and transitioning to ODE solvers with torchdiffeq. Enhanced support for the Chinese language, along with multilingual capabilities, underscores this release.

Pretrained Models

StableTTS offers robust pretrained models poised for text-to-mel and mel-to-wav conversion tasks. The former is ready for downloading to facilitate inference, finetuning, and interface applications. Users can select between vocoders like vocos or firefly-gan to transition from mel spectrograms to wav files.

Text-To-Mel Model: Download and place in ./checkpoints.
Mel-To-Wav Model: Choose a vocoder and place it in ./vocoders/pretrained.

Installation and Operation

To get started with StableTTS, you'll need to follow a few installation steps:

Install PyTorch: Follow the official PyTorch guide to install.
Dependencies: Simply run pip install -r requirements.txt to ensure all necessary Python packages are installed.

For executing inferences or accessing the web-based UI, inference.ipynb and webui.py provide detailed guidance.

Training

Training StableTTS is streamlined and requires only text-audio pairs without needing additional features. The training process involves preparing your data—a straightforward task of generating text and audio file lists, preprocessing them, and setting the training configurations before beginning the training process.

Model Structure

StableTTS's structure is sophisticated yet efficient, utilizing the Diffusion Convolution Transformer block from Hierspeech++, combining original DiT with FFT for enhanced vocal prosody. A FiLM layer conditions timestep embedding within the flow-matching decoder.

Acknowledgments

StableTTS stands on the shoulders of prior innovations, drawing from projects like Matcha TTS, Grad TTS, and the influential Stable Diffusion 3, among others. These inspirations have enriched StableTTS's development, attributed deep learning frameworks, and innovative vocoder integrations.

Future Outlook

StableTTS continues to evolve, with ongoing efforts to enhance documentation and extend language support, promising refinement and further innovations in its TTS capabilities. The vision for StableTTS encompasses continuous improvement and adaptation, ensuring it remains a bridge between cutting-edge research and accessible, open-source technology.

Disclaimer

StableTTS prohibits using its technology to generate or alter an individual's speech without explicit consent. This includes, but is not limited to, modifying the speech of prominent figures such as government leaders and celebrities. Users are reminded to respect copyright laws and uphold individual rights in their applications.