Introduction to XPhoneBERT
XPhoneBERT is the pioneering pre-trained multilingual model designed for phoneme representations in text-to-speech (TTS) systems. It builds on the foundation of the renowned BERT-base architecture, implementing RoBERTa's pre-training methodology. The model is trained using an expansive dataset that includes 330 million phoneme-level sentences sourced from nearly 100 diverse languages and locales. The remarkable capability of XPhoneBERT lies in its ability to enhance the naturalness and prosody of speech generated by neural TTS models, even when limited training data is available. This innovation makes significant strides in the quality of speech production across multiple languages.
The architecture details and findings from XPhoneBERT experiments are documented in the INTERSPEECH 2023 paper, which can be accessed here.
Installation Essentials for Using XPhoneBERT
To start using XPhoneBERT with the transformers
library, users will need the following installations:
- Install the
transformers
library via pip:pip install transformers
. Alternatively, it can be installed from the source for more customization. - Install
text2phonemesequence
: This can be done withpip install text2phonemesequence
. The package is integral for converting text into phoneme-level sequences, an essential process for multilingual phoneme-level pre-training data construction. It is crafted by employing tools from CharsiuG2P for text-to-phoneme conversion and segments toolkit for phoneme segmentation.
Important Notes
- The
text2phonemesequence
requires the appropriate ISO 639-3 language code for initialization. A list of supported languages and corresponding codes is available here. - Input sequences to
text2phonemesequence
must be word-segmented. In some cases, text normalization might be advisable before processing. During XPhoneBERT's data preparation, sentence and word segmentation are performed using the spaCy toolkit for most languages, while Vietnamese uses VnCoreNLP toolkit. For text normalization, different toolkits are employed, such as NVIDIA NeMo for English, German, Spanish, and Chinese, and Vinorm for Vietnamese.
XPhoneBERT Pre-trained Model Details
The following summarizes the core attributes of the pre-trained XPhoneBERT model:
Model | # Parameters | Architecture | Max Length | Pre-training Data |
---|---|---|---|---|
vinai/xphonebert-base | 88 Million | Base | 512 | 330 million phoneme-level sentences from nearly 100 languages |
Getting Started: Example of XPhoneBERT Usage
Here's a simple demonstration of how XPhoneBERT can be employed in practice:
from transformers import AutoModel, AutoTokenizer
from text2phonemesequence import Text2PhonemeSequence
# Load XPhoneBERT model and its tokenizer
xphonebert = AutoModel.from_pretrained("vinai/xphonebert-base")
tokenizer = AutoTokenizer.from_pretrained("vinai/xphonebert-base")
# Load Text2PhonemeSequence
text2phone_model = Text2PhonemeSequence(language='jpn', is_cuda=True)
# Input sequence that is already WORD-SEGMENTED
sentence = "これ は 、 テスト テキスト です ."
input_phonemes = text2phone_model.infer_sentence(sentence)
input_ids = tokenizer(input_phonemes, return_tensors="pt")
with torch.no_grad():
features = xphonebert(**input_ids)
This example illustrates loading the model, the tokenizer, and a phoneme sequence generator to process an example sentence. The generated features from XPhoneBERT can then be further utilized in downstream TTS tasks.
Licensing Information
The software is released under the MIT License, granting users freedom in using, modifying, and distributing the software while including the original license notice in any substantial portion of the software.
The XPhoneBERT project opens new possibilities in speech synthesis, breaking language barriers and setting a new standard in multilingual TTS applications.