XPhoneBERT - Enhancing Multilingual Text-to-Speech with Pre-trained Phoneme Models

Introduction to XPhoneBERT

XPhoneBERT is the pioneering pre-trained multilingual model designed for phoneme representations in text-to-speech (TTS) systems. It builds on the foundation of the renowned BERT-base architecture, implementing RoBERTa's pre-training methodology. The model is trained using an expansive dataset that includes 330 million phoneme-level sentences sourced from nearly 100 diverse languages and locales. The remarkable capability of XPhoneBERT lies in its ability to enhance the naturalness and prosody of speech generated by neural TTS models, even when limited training data is available. This innovation makes significant strides in the quality of speech production across multiple languages.

The architecture details and findings from XPhoneBERT experiments are documented in the INTERSPEECH 2023 paper, which can be accessed here.

Installation Essentials for Using XPhoneBERT

To start using XPhoneBERT with the transformers library, users will need the following installations:

Install the transformers library via pip: pip install transformers. Alternatively, it can be installed from the source for more customization.
Install text2phonemesequence: This can be done with pip install text2phonemesequence. The package is integral for converting text into phoneme-level sequences, an essential process for multilingual phoneme-level pre-training data construction. It is crafted by employing tools from CharsiuG2P for text-to-phoneme conversion and segments toolkit for phoneme segmentation.

Important Notes

The text2phonemesequence requires the appropriate ISO 639-3 language code for initialization. A list of supported languages and corresponding codes is available here.
Input sequences to text2phonemesequence must be word-segmented. In some cases, text normalization might be advisable before processing. During XPhoneBERT's data preparation, sentence and word segmentation are performed using the spaCy toolkit for most languages, while Vietnamese uses VnCoreNLP toolkit. For text normalization, different toolkits are employed, such as NVIDIA NeMo for English, German, Spanish, and Chinese, and Vinorm for Vietnamese.

XPhoneBERT Pre-trained Model Details

The following summarizes the core attributes of the pre-trained XPhoneBERT model:

Model	# Parameters	Architecture	Max Length	Pre-training Data
`vinai/xphonebert-base`	88 Million	Base	512	330 million phoneme-level sentences from nearly 100 languages

Getting Started: Example of XPhoneBERT Usage

Here's a simple demonstration of how XPhoneBERT can be employed in practice:

from transformers import AutoModel, AutoTokenizer
from text2phonemesequence import Text2PhonemeSequence

# Load XPhoneBERT model and its tokenizer
xphonebert = AutoModel.from_pretrained("vinai/xphonebert-base")
tokenizer = AutoTokenizer.from_pretrained("vinai/xphonebert-base")

# Load Text2PhonemeSequence
text2phone_model = Text2PhonemeSequence(language='jpn', is_cuda=True)

# Input sequence that is already WORD-SEGMENTED
sentence = "これ は 、 テスト テキスト です ."

input_phonemes = text2phone_model.infer_sentence(sentence)
input_ids = tokenizer(input_phonemes, return_tensors="pt")

with torch.no_grad():
    features = xphonebert(**input_ids)

This example illustrates loading the model, the tokenizer, and a phoneme sequence generator to process an example sentence. The generated features from XPhoneBERT can then be further utilized in downstream TTS tasks.

Licensing Information

The software is released under the MIT License, granting users freedom in using, modifying, and distributing the software while including the original license notice in any substantial portion of the software.

The XPhoneBERT project opens new possibilities in speech synthesis, breaking language barriers and setting a new standard in multilingual TTS applications.