Transformer-TTS: A Deep Dive into Speech Synthesis
Overview
Transformer-TTS is a PyTorch implementation designed for neural speech synthesis using a Transformer Network. This innovative model marks a significant advancement in speech synthesis technology, providing faster training times and maintaining high-quality audio outputs comparable to other seq2seq models, such as Tacotron. Experimentation has shown that this model operates approximately three to four times faster than its predecessors, requiring about 0.5 seconds per step.
In developing this model, the creator opted not to use the WaveNet vocoder. Instead, the post-processing network leverages the CBHG model from Tacotron, transforming the spectrogram into raw audio with the Griffin-Lim algorithm.
System Requirements
To get started with Transformer-TTS, users must ensure the following system requirements are met:
- Installation of Python 3
- PyTorch version 0.4.0
- Additional requirements can be installed via:
pip install -r requirements.txt
Data Utilization
The LJSpeech dataset, containing 13,100 pairs of text scripts and corresponding WAV files, was used for training. This dataset can be downloaded from here, with preprocessing scripts found in the forks of Tacotron and dc_tts.
Pretrained Model
Pretrained models are available for use and can be downloaded here. It's crucial to place these models in the checkpoint directory.
Attention Mechanism
Transformer-TTS utilizes a sophisticated attention mechanism. After approximately 15,000 training steps, a diagonal alignment becomes visible in the attention plots, reaching full form by 160,000 steps. These plots illustrate the multi-head attention process across the layers, with 12 attention plots drawn for the encoder, decoder, and encoder-decoder layers.
-
Self Attention Encoder:
-
Self Attention Decoder:
-
Attention Encoder-Decoder:
Training Dynamics
The learning process of Transformer-TTS follows Noam's style for warmups and decays, similar to Tacotron. The alpha values, crucial for position encoding, behave differently than those described in the source thesis. The encoder's alpha value initially increases before steadily decreasing, while the decoder's alpha continually decreases.
-
Learning Curves:
-
Alpha Values:
Key Experimental Insights
During the experimentation phase, several critical insights were identified:
- Learning Rate: An important parameter for training; an initial value of 0.001 with exponential decay was ineffective.
- Gradient Clipping: Implementing a gradient clipping with a norm value of 1 was essential.
- The model's training process was unresponsive to stop token loss.
- Concatenating input and context vectors in the attention mechanism proved vital.
Generated Samples
The model is capable of generating synthesized speech samples, indicative of ongoing convergence towards high-quality outputs. Though some samples at 160,000 steps show limited performance with longer sentences, they highlight the model's potential.
File Descriptions
The project's files are organized as follows:
hyperparams.py
: Contains all necessary hyperparameters.prepare_data.py
: Preprocesses WAV files and saves them as mel and linear spectrograms.preprocess.py
: Houses all data preprocessing scripts.module.py
: Includes essential methods like attention, prenet, postnet, etc.network.py
: Comprises encoder, decoder, and post-processing network architecture.train_transformer.py
: For training the autoregressive attention network (text to mel conversion).train_postnet.py
: For training the post network (mel to linear conversion).synthesis.py
: Generates TTS samples.
Training and Generating Speech
To train the network and produce speech outputs, follow these steps:
-
Training:
- Download LJSpeech data to a chosen directory.
- Adjust hyperparameters in
hyperparams.py
, particularly 'data_path'. - Run
prepare_data.py
. - Execute
train_transformer.py
. - Start
train_postnet.py
.
-
Generating WAV Files:
- Run
synthesis.py
, verifying the restore step.
- Run
Acknowledgments and Feedback
The Transformer-TTS project acknowledges contributions from tacotron and dc_tts repositories, as well as the PyTorch implementation of "Attention is All You Need".
For any questions or feedback regarding the codebase, the project encourages community input and discussion.