PortaSpeech - Generative Text-to-Speech with Lightweight PyTorch Models for High Quality Synthesis

Introduction to PortaSpeech

PortaSpeech is a high-quality, portable text-to-speech (TTS) system implemented using PyTorch. It aims to provide users with an efficient and versatile solution for generating natural-sounding speech from text. Based on the research paper "PortaSpeech: Portable and High-Quality Generative Text-to-Speech," the framework focuses on delivering a compact text-to-speech model that can be easily integrated and operated.

Audio Samples

For those interested in listening to the audio quality produced by PortaSpeech, a variety of sample outputs can be found in the project's demo section.

Model Size Variants

PortaSpeech comes in different sizes to accommodate various computational capacities and needs:

Normal Model: Offers higher quality with a model size of 24M parameters.
Small Model: A more compact version with 7.6M parameters, ideal for environments with limited resources.

These size variants enable users to choose the best fit according to their specific requirements.

Quickstart Guide

Dependencies

To begin using PortaSpeech, one must first install the necessary Python dependencies. This can be done easily through the provided requirements.txt file. For users who prefer Docker, a Dockerfile is available for setting up the environment.

Inference Process

To synthesize speech, users need to download pretrained models and place them in the specified directory. PortaSpeech supports both single-speaker TTS. Users can generate speech from their desired text by executing a simple Python command.

Batch Inference and Controllability

PortaSpeech offers batch inference capabilities, allowing users to process multiple text inputs simultaneously. Additionally, the system can adjust the speaking rate of the generated speech by modifying duration ratios, providing users with further control over the output.

Training

Datasets

Currently, PortaSpeech supports the LJSpeech dataset, a single-speaker English dataset. This dataset is widely used as a standard for TTS systems and consists of numerous audio clips read by a female speaker.

Preprocessing and Training

Before training, users must perform data preprocessing, including forced alignment using the Montreal Forced Aligner. After preparing alignments, the training process can be initiated using a straightforward command. Options for optimizing the training process, such as mixed precision, are also available.

Visualization with TensorBoard

PortaSpeech integrates with TensorBoard, providing users with visual insights into the training process. Users can monitor loss curves, synthesized spectrograms, and audio outputs, facilitating a comprehensive understanding of the model's performance.

Additional Features and Notes

PortaSpeech supports HiFi-GAN and MelGAN vocoders for improved audio quality.
Unique techniques like subword division and sorting by spectrogram length are employed to accelerate model training.
Various loss functions, such as Diagonal Guided Attention (DGA) and Connectionist Temporal Classification (CTC) Loss, are implemented to enhance word-to-phoneme alignment.

Ongoing Developments

PortaSpeech is continuously evolving, with plans to extend its capabilities to support multi-speaker TTS scenarios, enhancing its versatility and application scope.

Conclusion

PortaSpeech represents an accessible and high-quality solution for those interested in text-to-speech technologies. Its blend of compactness, ease of integration, and high audio fidelity makes it a compelling choice for developers and researchers in the TTS domain.