SpeechTokenizer - Unified Speech Tokenizer for Optimizing Language Model Efficiency

SpeechTokenizer: Unifying Speech Tokenization for Language Models

Introduction

SpeechTokenizer is an innovative project that seeks to enhance speech language modeling through a unified tokenization approach. It utilizes an Encoder-Decoder architecture with a focus on residual vector quantization (RVQ). The tool seamlessly integrates both semantic and acoustic tokens, organizing different facets of speech information across several layers of RVQ. It is designed to effectively distinguish and handle various speech elements hierarchically, making it a valuable asset for speech language processing.

At the core of SpeechTokenizer is its unique ability to process speech by prioritizing semantic content. The initial quantizer layer predominantly outputs semantic tokens, capturing the core content of the speech. The subsequent quantizers mainly capture the timbre, supplementing any lost information after the initial quantization.

The models provided by SpeechTokenizer are optimized to operate at 16kHz on monophonic speech. Two principal models are highlighted:

A model trained on Librispeech using average representation from all HuBERT layers as a semantic guideline.
A model using Snake activation, trained with both Librispeech and Common Voice datasets.

Features and Capabilities

Model Flexibility and Training:
SpeechTokenizer offers flexibility in speech model training and usage, supporting a range of configurations to meet specific research needs. With the release of training code and various checkpoints, users can readily adapt models with their own datasets or further explore the pre-trained setups.

Installation & Setup:
Setting up SpeechTokenizer requires Python 3.8 or higher and a compatible version of PyTorch. The project can be installed directly via pip or cloned from the repository for local installation:

pip install -U speechtokenizer

Model List and Usage:
Users have access to models like speechtokenizer_hubert_avg and speechtokenizer_snake, each offering distinct features based on training datasets and techniques. Models can be loaded and evaluated as follows:

from speechtokenizer import SpeechTokenizer

config_path = '/path/config.json'
ckpt_path = '/path/SpeechTokenizer.pt'
model = SpeechTokenizer.load_from_checkpoint(config_path, ckpt_path)
model.eval()

By leveraging these models, detailed representations of speech can be extracted and further analyzed for semantic content and timbre distinction.

Training SpeechTokenizer:
To facilitate custom model training, SpeechTokenizer provides scripts and configuration files to preprocess data, extract necessary representations, and execute training. Users can follow these scripts to quickly set up the training environment and commence model development.

Contributions and Impact

SpeechTokenizer stands out by offering a comprehensive approach to speech tokenization, deftly managing the balance between semantic and acoustic dimensions of speech. Its application extends across various linguistic and computational fields, enhancing the robustness and accuracy of speech language models.

Citation

Researchers utilizing SpeechTokenizer in their work are encouraged to cite the following paper:

@misc{zhang2023speechtokenizer,
      title={SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models}, 
      author={Xin Zhang and Dong Zhang and Shimin Li and Yaqian Zhou and Xipeng Qiu},
      year={2023},
      eprint={2308.16692},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

License

SpeechTokenizer is made available under the Apache 2.0 license, supporting open and collaborative development practices.

In summary, SpeechTokenizer presents a unique and efficient mechanism for improving speech processing in language models, offering cutting-edge technology that empowers researchers to delve deeper into the intricacies of spoken language.