StyleSpeech - Advanced Multi-Speaker Text-to-Speech Model for High-Quality, Adaptive Speech Generation

Meta-StyleSpeech: Multi-Speaker Adaptive Text-to-Speech Generation

Recent Updates

December 18, 2021: A pre-trained multi-speaker MelGAN vocoder in 16kHz, thanks to Guan-Ting Lin, is now available for use. You can find details and access it through the Pre-trained 16k-MelGAN link.
June 9, 2021: Improvements to the Variance Adaptor include modifications such as changing its architecture to two Conv1D layers followed by a linear layer and adding a layer normalization and phoneme-wise positional encoding for better quality results.

Introduction

The Meta-StyleSpeech project presents a novel approach in text-to-speech (TTS) technology, specifically designed for generating voice outputs that mimic the style and tone of any given speaker with minimal audio input. Derived from the core StyleSpeech model, Meta-StyleSpeech pushes the boundaries by excelling in adapting to multiple speakers quickly and efficiently, even with short audio samples.

Abstract

In recent times, the demand for personalized TTS solutions has soared, aimed at producing high-quality, personalized audio clips. Existing models, however, either require extensive fine-tuning or fall short in adaptation quality. Meta-StyleSpeech overcomes these challenges. Its key innovation, the Style-Adaptive Layer Normalization (SALN), fine-tunes the input text by aligning it with the reference speaker's style without the need for extensive data training. This allows for effective adaptation even from a single short-duration speech sample. Furthermore, Meta-StyleSpeech incorporates training techniques using two discriminators and style prototypes, elevating its performance significantly over traditional models.

For live demonstrations, listeners can explore sample outputs on the demo page.

Accessing Pretrained Models

Access to the pretrained models can significantly accelerate the implementation process:

Model Name	Download Link
Meta-StyleSpeech	Download
StyleSpeech	Download

Prerequisites

Ensure you have the necessary environment setup:

Clone the project repository.
Install required Python packages by referring to the provided requirements.txt.

Inference

Generating speech with StyleSpeech requires downloading the pretrained models. Then, use the following command:

python synthesize.py --text <text to be synthesized> --ref_audio <path to reference audio> --checkpoint_path <path to pretrained model>

The resulting mel-spectrogram will be stored within the results/ folder.

Preprocessing the Dataset

Meta-StyleSpeech uses the LibriTTS dataset for training. Follow these steps to prepare the dataset:

Download and extract the dataset into the dataset/ folder.
Use the script prepare_align.py to resample audio files to 16kHz.

Employ the Montreal Forced Aligner for phoneme alignment:

./montreal-forced-aligner/bin/mfa_align dataset/wav16/ lexicon/librispeech-lexicon.txt english dataset/TextGrid/ -j 10 -v

Run preprocess.py to prepare the mel-spectrograms and other necessary data for training.

Training

To start training StyleSpeech from scratch, execute:

python train.py

For Meta-StyleSpeech, begin training from a pretrained StyleSpeech model using:

python train_meta.py --checkpoint_path <path to pretrained StyleSpeech model>

Acknowledgements

The development of Meta-StyleSpeech was inspired and built upon the works of several foundational TTS models, including FastSpeech2, ming024's FastSpeech implementation, Mellotron, and Tacotron.