PL-BERT - Improve Text-to-Speech Prosody with Phoneme-Level BERT

Introduction to Phoneme-Level BERT for Text-to-Speech Enhancement

Background

In recent years, large-scale pre-trained language models have significantly advanced various AI applications, including text-to-speech (TTS) systems. These models enhance the prosodic features—rhythms and intonations that make speech sound natural. However, traditional models face challenges as they operate on word-level or sub-phoneme-level and involve phonemes, the basic sound units in speech. This approach can be inefficient for TTS tasks that primarily require phonemes.

What is PL-BERT?

PL-BERT, which stands for Phoneme-Level BERT, is a novel solution designed to improve the naturalness of synthesized speech. It operates directly on the phoneme level, making it more efficient for TTS applications. The unique aspect of PL-BERT is its ability to predict corresponding graphemes—letters or letter combinations representing phonemes—while performing traditional masked phoneme predictions.

Advantages

Subjective evaluations have shown that PL-BERT significantly enhances the Mean Opinion Scores (MOS) for the naturalness of synthesized speech, outperforming the state-of-the-art StyleTTS baseline when applied to out-of-distribution (OOD) texts.

Getting Started

To use PL-BERT, one needs:

Python version 3.7 or higher.
Clone the PL-BERT repository from GitHub.
Create a new Python environment for better management.
Install the necessary Python libraries and packages.

Preprocessing

The preprocessing step primarily involves handling the English Wikipedia dataset to train the model. There's potential for expansion to other languages such as Japanese, with ongoing work in this area.

Training

PL-BERT training is facilitated through a Jupyter notebook format, providing an interactive approach to executing code. Users can modify configuration settings to customize their training setup.

Finetuning

PL-BERT can be integrated into other TTS models like StyleTTS. This involves:

Loading the PL-BERT model within the StyleTTS framework.
Adjusting the learning rates for stability.
Replacing the text encoder with the PL-BERT encoder for improved performance.

Downloads and Resources

The pre-trained PL-BERT model, trained for 1 million steps on Wikipedia, is available for use. A demo on the LJSpeech dataset along with a pre-modified StyleTTS repository is also accessible for download.

Conclusion

PL-BERT offers a promising development in the field of TTS systems, especially in enhancing the naturalness of synthetic speech. Through phoneme-level processing and grapheme predictions, PL-BERT contributes to more efficient and effective TTS applications.

For more technical details, prospective users can explore the research paper and check out audio samples.

References

The project references include contributions from NVIDIA's NeMo text processing and TTSTextNormalization projects.

By adopting PL-BERT, developers and researchers can significantly advance their TTS applications, achieving speech output that sounds more human-like and natural.