USLM - Innovative Models for Hierarchical Speech Processing

USLM: Unified Speech Language Model

Introduction

The Unified Speech Language Model (USLM) is a sophisticated framework built upon the foundations of the SpeechTokenizer. It integrates both autoregressive and non-autoregressive models to effectively process and discern information within speech. The autoregressive (AR) model focuses on understanding the core content by analyzing tokens from the initial RVQ quantizer. Meanwhile, the non-autoregressive (NAR) model enriches the AR model's capabilities by extracting and generating tokens from subsequent quantizers based on the primary layer's tokens. This hierarchical modeling allows USLM to more comprehensively interpret and synthesize speech data.

Installation

Getting started with USLM requires a few installations. Below is a streamlined guide:

PyTorch: Install the necessary PyTorch package.

pip install torch==1.13.1 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu116
pip install torchmetrics==0.11.1

Librosa for fbank:
```
pip install librosa==0.8.1
```

Phonemizer and Pypinyin: Install espeak-ng and other required packages.

apt-get install espeak-ng
pip install phonemizer==3.2.1 pypinyin==0.48.0

Update Lhotse: Replace any existing version and install the latest from the repository.
```
pip uninstall lhotse
pip install git+https://github.com/lhotse-speech/lhotse
```

K2 Installation: Find and install the correct version for your setup using Hugging Face.

pip install https://huggingface.co/csukuangfj/k2/resolve/main/cuda/k2-1.23.4.dev20230224+cuda11.6.torch1.13.1-cp310-cp310-linux_x86_64.whl

Icefall Setup: Clone and configure the Icefall project.

git clone https://github.com/k2-fsa/icefall
cd icefall
pip install -r requirements.txt
export PYTHONPATH=`pwd`/../icefall:$PYTHONPATH
echo "export PYTHONPATH=`pwd`/../icefall:\$PYTHONPATH" >> ~/.zshrc
echo "export PYTHONPATH=`pwd`/../icefall:\$PYTHONPATH" >> ~/.bashrc
cd -
source ~/.zshrc

SpeechTokenizer Installation:
```
pip install -U speechtokenizer
```

USLM Setup: Clone the USLM repository and set it up.

git clone https://github.com/0nutation/USLM
cd USLM
pip install -e .

USLM Models

The USLM framework currently trains on the LibriTTS dataset. Given data limits, its performance may not be fully optimized yet.

Model	Dataset	Description
USLM_libri	LibriTTS	Trained on the LibriTTS dataset

Zero-shot TTS Using USLM

Zero-shot Text-to-Speech (TTS) synthesis using USLM involves several steps:

Download Pre-trained SpeechTokenizer Models:

st_dir="ckpt/speechtokenizer/"
mkdir -p ${st_dir}
cd ${st_dir}
wget "https://huggingface.co/fnlp/SpeechTokenizer/resolve/main/speechtokenizer_hubert_avg/SpeechTokenizer.pt"
wget "https://huggingface.co/fnlp/SpeechTokenizer/resolve/main/speechtokenizer_hubert_avg/config.json" 
cd -

Download Pre-trained USLM Models:

uslm_dir="ckpt/uslm/"
mkdir -p ${uslm_dir}
cd ${uslm_dir}
wget "https://huggingface.co/fnlp/USLM/resolve/main/USLM_libritts/USLM.pt"
wget "https://huggingface.co/fnlp/USLM/resolve/main/USLM_libritts/unique_text_tokens.k2symbols" 
cd -

Run Inference:

out_dir="output/"
mkdir -p ${out_dir}

python3 bin/infer.py --output-dir ${out_dir}/ \
    --model-name uslm --norm-first true --add-prenet false \
    --share-embedding true --norm-first true --add-prenet false \
    --audio-extractor SpeechTokenizer \
    --speechtokenizer-dir "${st_dir}" \
    --checkpoint=${uslm_dir}/USLM.pt \
    --text-tokens "${uslm_dir}/unique_text_tokens.k2symbols" \
    --text-prompts "mr Soames was a tall, spare man, of a nervous and excitable temperament." \
    --audio-prompts prompts/1580_141083_000002_000002.wav \
    --text "Begin with the fundamental steps of the process. This will give you a solid foundation to build upon and boost your confidence. "

Alternatively, the process can be initiated using a script:

bash inference.sh

Acknowledgment

The USLM project builds on foundations provided by VALL-E, which serves as the key codebase.

Citation

For those using the USLM framework in research or other projects, the following citation is requested:

@misc{zhang2023speechtokenizer,
      title={SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models}, 
      author={Xin Zhang and Dong Zhang and Shimin Li and Yaqian Zhou and Xipeng Qiu},
      year={2023},
      eprint={2308.16692},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}