Meta-StyleSpeech: Multi-Speaker Adaptive Text-to-Speech Generation
Recent Updates
- December 18, 2021: A pre-trained multi-speaker MelGAN vocoder in 16kHz, thanks to Guan-Ting Lin, is now available for use. You can find details and access it through the Pre-trained 16k-MelGAN link.
- June 9, 2021: Improvements to the Variance Adaptor include modifications such as changing its architecture to two Conv1D layers followed by a linear layer and adding a layer normalization and phoneme-wise positional encoding for better quality results.
Introduction
The Meta-StyleSpeech project presents a novel approach in text-to-speech (TTS) technology, specifically designed for generating voice outputs that mimic the style and tone of any given speaker with minimal audio input. Derived from the core StyleSpeech model, Meta-StyleSpeech pushes the boundaries by excelling in adapting to multiple speakers quickly and efficiently, even with short audio samples.
Abstract
In recent times, the demand for personalized TTS solutions has soared, aimed at producing high-quality, personalized audio clips. Existing models, however, either require extensive fine-tuning or fall short in adaptation quality. Meta-StyleSpeech overcomes these challenges. Its key innovation, the Style-Adaptive Layer Normalization (SALN), fine-tunes the input text by aligning it with the reference speaker's style without the need for extensive data training. This allows for effective adaptation even from a single short-duration speech sample. Furthermore, Meta-StyleSpeech incorporates training techniques using two discriminators and style prototypes, elevating its performance significantly over traditional models.
For live demonstrations, listeners can explore sample outputs on the demo page.
Accessing Pretrained Models
Access to the pretrained models can significantly accelerate the implementation process:
Prerequisites
Ensure you have the necessary environment setup:
- Clone the project repository.
- Install required Python packages by referring to the provided
requirements.txt
.
Inference
Generating speech with StyleSpeech requires downloading the pretrained models. Then, use the following command:
python synthesize.py --text <text to be synthesized> --ref_audio <path to reference audio> --checkpoint_path <path to pretrained model>
The resulting mel-spectrogram will be stored within the results/
folder.
Preprocessing the Dataset
Meta-StyleSpeech uses the LibriTTS dataset for training. Follow these steps to prepare the dataset:
- Download and extract the dataset into the
dataset/
folder. - Use the script
prepare_align.py
to resample audio files to 16kHz. - Employ the Montreal Forced Aligner for phoneme alignment:
./montreal-forced-aligner/bin/mfa_align dataset/wav16/ lexicon/librispeech-lexicon.txt english dataset/TextGrid/ -j 10 -v
- Run
preprocess.py
to prepare the mel-spectrograms and other necessary data for training.
Training
To start training StyleSpeech from scratch, execute:
python train.py
For Meta-StyleSpeech, begin training from a pretrained StyleSpeech model using:
python train_meta.py --checkpoint_path <path to pretrained StyleSpeech model>
Acknowledgements
The development of Meta-StyleSpeech was inspired and built upon the works of several foundational TTS models, including FastSpeech2, ming024's FastSpeech implementation, Mellotron, and Tacotron.