USLM: Unified Speech Language Model
Introduction
The Unified Speech Language Model (USLM) is a sophisticated framework built upon the foundations of the SpeechTokenizer. It integrates both autoregressive and non-autoregressive models to effectively process and discern information within speech. The autoregressive (AR) model focuses on understanding the core content by analyzing tokens from the initial RVQ quantizer. Meanwhile, the non-autoregressive (NAR) model enriches the AR model's capabilities by extracting and generating tokens from subsequent quantizers based on the primary layer's tokens. This hierarchical modeling allows USLM to more comprehensively interpret and synthesize speech data.
Installation
Getting started with USLM requires a few installations. Below is a streamlined guide:
-
PyTorch: Install the necessary PyTorch package.
pip install torch==1.13.1 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu116 pip install torchmetrics==0.11.1
-
Librosa for fbank:
pip install librosa==0.8.1
-
Phonemizer and Pypinyin: Install
espeak-ng
and other required packages.apt-get install espeak-ng pip install phonemizer==3.2.1 pypinyin==0.48.0
-
Update Lhotse: Replace any existing version and install the latest from the repository.
pip uninstall lhotse pip install git+https://github.com/lhotse-speech/lhotse
-
K2 Installation: Find and install the correct version for your setup using Hugging Face.
pip install https://huggingface.co/csukuangfj/k2/resolve/main/cuda/k2-1.23.4.dev20230224+cuda11.6.torch1.13.1-cp310-cp310-linux_x86_64.whl
-
Icefall Setup: Clone and configure the Icefall project.
git clone https://github.com/k2-fsa/icefall cd icefall pip install -r requirements.txt export PYTHONPATH=`pwd`/../icefall:$PYTHONPATH echo "export PYTHONPATH=`pwd`/../icefall:\$PYTHONPATH" >> ~/.zshrc echo "export PYTHONPATH=`pwd`/../icefall:\$PYTHONPATH" >> ~/.bashrc cd - source ~/.zshrc
-
SpeechTokenizer Installation:
pip install -U speechtokenizer
-
USLM Setup: Clone the USLM repository and set it up.
git clone https://github.com/0nutation/USLM cd USLM pip install -e .
USLM Models
The USLM framework currently trains on the LibriTTS dataset. Given data limits, its performance may not be fully optimized yet.
Model | Dataset | Description |
---|---|---|
USLM_libri | LibriTTS | Trained on the LibriTTS dataset |
Zero-shot TTS Using USLM
Zero-shot Text-to-Speech (TTS) synthesis using USLM involves several steps:
-
Download Pre-trained SpeechTokenizer Models:
st_dir="ckpt/speechtokenizer/" mkdir -p ${st_dir} cd ${st_dir} wget "https://huggingface.co/fnlp/SpeechTokenizer/resolve/main/speechtokenizer_hubert_avg/SpeechTokenizer.pt" wget "https://huggingface.co/fnlp/SpeechTokenizer/resolve/main/speechtokenizer_hubert_avg/config.json" cd -
-
Download Pre-trained USLM Models:
uslm_dir="ckpt/uslm/" mkdir -p ${uslm_dir} cd ${uslm_dir} wget "https://huggingface.co/fnlp/USLM/resolve/main/USLM_libritts/USLM.pt" wget "https://huggingface.co/fnlp/USLM/resolve/main/USLM_libritts/unique_text_tokens.k2symbols" cd -
-
Run Inference:
out_dir="output/" mkdir -p ${out_dir} python3 bin/infer.py --output-dir ${out_dir}/ \ --model-name uslm --norm-first true --add-prenet false \ --share-embedding true --norm-first true --add-prenet false \ --audio-extractor SpeechTokenizer \ --speechtokenizer-dir "${st_dir}" \ --checkpoint=${uslm_dir}/USLM.pt \ --text-tokens "${uslm_dir}/unique_text_tokens.k2symbols" \ --text-prompts "mr Soames was a tall, spare man, of a nervous and excitable temperament." \ --audio-prompts prompts/1580_141083_000002_000002.wav \ --text "Begin with the fundamental steps of the process. This will give you a solid foundation to build upon and boost your confidence. "
Alternatively, the process can be initiated using a script:
bash inference.sh
Acknowledgment
The USLM project builds on foundations provided by VALL-E, which serves as the key codebase.
Citation
For those using the USLM framework in research or other projects, the following citation is requested:
@misc{zhang2023speechtokenizer,
title={SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models},
author={Xin Zhang and Dong Zhang and Shimin Li and Yaqian Zhou and Xipeng Qiu},
year={2023},
eprint={2308.16692},
archivePrefix={arXiv},
primaryClass={cs.CL}
}