StableTTS: Revolutionizing Text-To-Speech Technology
Introduction
StableTTS stands as an innovative next-generation text-to-speech (TTS) model. It's the first open-source TTS model to integrate flow-matching with DiT, drawing inspiration from the renowned Stable Diffusion 3. This cutting-edge model is both fast and lightweight, equipped to generate speech in Chinese, English, and Japanese. At the heart of StableTTS is a compact framework of 31 million parameters, ensuring efficient and robust speech synthesis.
Latest Developments
October 2024 Update:
- A new autoregressive TTS model is set to debut soon.
September 2024 Release:
- StableTTS V1.1 was launched, significantly enhancing audio quality. Key upgrades include fixing critical audio quality issues and incorporating U-Net-like long skip connections to enhance the DiT in the Flow-matching Decoder. Other improvements involve utilizing a cosine timestep scheduler, adding support for Classifier-Free Guidance (CFG) and the FireflyGAN vocoder, and transitioning to ODE solvers with torchdiffeq. Enhanced support for the Chinese language, along with multilingual capabilities, underscores this release.
Pretrained Models
StableTTS offers robust pretrained models poised for text-to-mel and mel-to-wav conversion tasks. The former is ready for downloading to facilitate inference, finetuning, and interface applications. Users can select between vocoders like vocos
or firefly-gan
to transition from mel spectrograms to wav files.
- Text-To-Mel Model: Download and place in
./checkpoints
. - Mel-To-Wav Model: Choose a vocoder and place it in
./vocoders/pretrained
.
Installation and Operation
To get started with StableTTS, you'll need to follow a few installation steps:
- Install PyTorch: Follow the official PyTorch guide to install.
- Dependencies: Simply run
pip install -r requirements.txt
to ensure all necessary Python packages are installed.
For executing inferences or accessing the web-based UI, inference.ipynb
and webui.py
provide detailed guidance.
Training
Training StableTTS is streamlined and requires only text-audio pairs without needing additional features. The training process involves preparing your data—a straightforward task of generating text and audio file lists, preprocessing them, and setting the training configurations before beginning the training process.
Model Structure
StableTTS's structure is sophisticated yet efficient, utilizing the Diffusion Convolution Transformer block from Hierspeech++, combining original DiT with FFT for enhanced vocal prosody. A FiLM layer conditions timestep embedding within the flow-matching decoder.
Acknowledgments
StableTTS stands on the shoulders of prior innovations, drawing from projects like Matcha TTS, Grad TTS, and the influential Stable Diffusion 3, among others. These inspirations have enriched StableTTS's development, attributed deep learning frameworks, and innovative vocoder integrations.
Future Outlook
StableTTS continues to evolve, with ongoing efforts to enhance documentation and extend language support, promising refinement and further innovations in its TTS capabilities. The vision for StableTTS encompasses continuous improvement and adaptation, ensuring it remains a bridge between cutting-edge research and accessible, open-source technology.
Disclaimer
StableTTS prohibits using its technology to generate or alter an individual's speech without explicit consent. This includes, but is not limited to, modifying the speech of prominent figures such as government leaders and celebrities. Users are reminded to respect copyright laws and uphold individual rights in their applications.