ZMM-TTS - Accurate Multilingual Speech Synthesis Using Self-supervised Models

ZMM-TTS: Multilingual and Multispeaker Speech Synthesis

Introduction

ZMM-TTS, which stands for Zero-shot Multilingual and Multispeaker Text-to-Speech, is a unique framework for creating speech synthesis systems. Designed to operate in multiple languages and voices, it uses sophisticated self-supervised models that learn from audio and text data without needing extensive human supervision.

The core innovation of ZMM-TTS lies in its use of quantized latent speech representations. This means the system leverages hidden features of speech to construct sounds, allowing it to synthesize natural-sounding speech in various languages—even with little to no training data for new languages. This approach significantly advances the capabilities of traditional TTS systems, making it possible to generate speech that closely mimics specific speakers' voices.

Furthermore, the applications of ZMM-TTS are not confined to well-researched languages. It also shows considerable promise in synthesizing speech for languages with limited available data, known as low-resource languages.

Project Information

Release Details

The ZMM-TTS project has made available its code and pre-trained models across six languages: English, French, German, Portuguese, Spanish, and Swedish. These resources facilitate researchers and developers in testing and expanding the capabilities of TTS systems.

Demonstrations

You can explore samples of the synthesized speech on the project's demo page. This page provides insights into how the system handles different languages and speaker characteristics.

Installation

To get started with ZMM-TTS, a user needs a Python environment (version 3.8 or later) with PyTorch installed. Additional libraries, such as transformers for model support and speechbrain for speaker embedding extraction, are also necessary for full functionality.

git clone https://github.com/nii-yamagishilab-visitors/ZMM-TTS.git
cd ZMM-TTS
pip3 install -r requirements.txt
pip install transformers
pip install speechbrain

Data and Models

Pre-trained Models

ZMM-TTS relies on a spectrum of pre-trained models covering various languages. These models serve as building blocks for speech synthesis tasks. Notable among them are:

XLSR-53: A model trained on audio data from 53 languages.
ECAPA-TDNN: A model focused on speaker recognition.
XPhoneBERT: A text-based language model supporting 94 languages.

MM6 Dataset

The MM6 dataset is an essential component of ZMM-TTS, offering a balanced collection of multilingual and multispeaker speech data, derived from sources like the MLS database. This dataset supports training the models in the TTS system, although it doesn't include open-source Swedish data.

To acquire Swedish data, interested parties should contact The Norwegian Language Bank.

Preprocessing and Training

ZMM-TTS requires careful preprocessing of audio and text data. Here’s a brief on how preparation works:

Data Download and Normalization: Scripts are available to handle the downloading and normalizing of MLS data.
Feature Extraction: This involves extracting discrete speech representations and speaker embeddings.
Mel Spectrogram and Alignment: These steps convert text to features that the TTS system can learn from.
Model Training: The system trains in several stages:
- Convert text to vectors,
- Vectors to mel spectrograms,
- Mel spectrograms to waveforms.

Each model is trained under specific configurations, optimized for multilingual abilities and speaker diversity.

Testing and Results

For testing, users must prepare the test data, including metadata and speaker embeddings. Synthesized samples can be quickly produced using provided scripts, and results are viewable in designated directories.

Future Work and Citations

ZMM-TTS continues to evolve, with future goals including refining few-shot training capabilities and improving zero-shot synthesis.

For more information or to cite the project, refer to the article by Gong et al. on arXiv, titled: "ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations".

Licensing

The project is developed under the BSD-3-Clause license, with various components like txt2vec, vec2mel, and vec2wav under the MIT license.

ZMM-TTS offers a compelling framework for advanced speech synthesis, democratizing access to high-quality, natural-sounding multilingual and multispeaker TTS technology.