glow-tts - Flow-Based Approach for Fast and Adaptable Text-to-Speech

Glow-TTS: A Game-Changer in Text-to-Speech Technology

Introduction

Glow-TTS is an innovative project focused on transforming text into speech in a smooth, efficient, and natural manner. Developed by researchers Jaehyeon Kim, Sungwon Kim, Jungil Kong, and Sungroh Yoon, this project introduces a new way of generating speech using advanced machine learning techniques.

Background

In the world of text-to-speech (TTS) systems, recent models such as FastSpeech and ParaNet have made strides by generating mel-spectrograms from text in parallel, which makes these models faster than their predecessors. However, these parallel models have a significant limitation: they rely on autoregressive models for guidance since they need external aligners.

The Glow-TTS Approach

Glow-TTS stands out as a groundbreaking generative flow-based model for TTS. What sets it apart is that it does not require any external aligner to train. Instead, it employs a novel method that combines flow models with dynamic programming to autonomously find the best alignment between text and speech representation. This innovative approach is called monotonic alignment search.

Key Advantages

Robustness and Generalization: Glow-TTS ensures robust speech synthesis by enforcing strict monotonic alignments. This means it can handle extended utterances more effectively.
Speed and Quality: The model delivers a significant performance boost, being much faster than the autoregressive Tacotron 2 model, while maintaining similar speech quality.
Flexibility: The architecture also supports a multi-speaker setting, making it adaptable for various use cases.

Demonstrations and Resources

For those interested in experiencing Glow-TTS firsthand, a demo is available with audio samples. Additionally, a pretrained model is provided for users who want to explore the technology further.

Recent Updates

Recently, the team implemented two notable improvements to enhance Glow-TTS:

Enhanced Vocoder: By integrating the HiFi-GAN vocoder, the model now produces less noisy outputs. Details and samples of this improvement are available in the HiFi-GAN repository.
Better Pronunciation: By adding a blank token between input tokens, pronunciation quality has improved. Users can access a config file and a pretrained model for this feature.

Technical Environment

The development environment for Glow-TTS includes an array of tools and packages:

Python 3.6.9
Pytorch 1.2.0
Cython 0.29.12
Librosa 0.7.1
Numpy 1.16.4
Scipy 1.3.0

Additionally, for mixed-precision training, the project utilizes NVIDIA's apex.

Getting Started

To work with Glow-TTS, several prerequisites are necessary:

Dataset: Download and set up the LJ Speech dataset.
WaveGlow Model: Initialize the WaveGlow submodule and download the pre-trained model.
Code Building: Compile the monotonic alignment search code using Cython.

Training and Inference

To train the model, users can run:

sh train_ddi.sh configs/base.json base

For inference, guidance is provided in the inference example.

Acknowledgements

Glow-TTS's development is heavily influenced by contributions from various research repositories, including WaveGlow, Tensor2Tensor, and Mellotron.

Glow-TTS is a significant advancement in the field of text-to-speech generation, offering speed, flexibility, and robustness that set a new benchmark for future developments.