vall-e - VALL-E PyTorch Implementation Using EnCodec for TTS Synthesis

Introducing VALL-E

VALL-E is an unofficial PyTorch implementation designed to bring a new level of performance to text-to-speech synthesis. This innovative project is based on the EnCodec tokenizer, a cutting-edge tool developed by Facebook Research. The VALL-E project showcases several impressive features for those interested in deep learning, natural language processing, and audio synthesis.

Getting Started with VALL-E

For those eager to dive into VALL-E, there's a simple Google Colab example available. While it only focuses on overfitting a single utterance, it can serve as a great starting point for understanding the project's functionalities. Please note that, as of now, the pretrained model is still underway.

Prerequisites and Installation

To begin using VALL-E, a GPU supported by DeepSpeed is required, along with a pre-installed CUDA or ROCm compiler. This is because the trainer component relies on DeepSpeed, which has specific requirements for successful operation.

Installing VALL-E is straightforward and can be done using the following command:

pip install git+https://github.com/enhuiz/vall-e

Alternatively, one can clone the repository directly:

git clone --recurse-submodules https://github.com/enhuiz/vall-e.git

It's important to note that the VALL-E code has been tested specifically with Python version 3.10.7.

Training the VALL-E Model

The training process involves several steps:

Data Preparation: Gather your audio files (in .wav format) and corresponding text files (with a .normalized.txt suffix) in a designated folder.
Quantizing Data: Convert the audio data into quantized representations using the command:
```
python -m vall_e.emb.qnt data/your_data
```
Generating Phonemes: Transform text files into phonemes with the following command:
```
python -m vall_e.emb.g2p data/your_data
```
Configuration Setup: Create configuration files in config/your_data for both AR and NAR models. Reference example configurations for guidance found in config/test and vall_e/config.py.
Model Training: Train your models using:
```
python -m vall_e.train yaml=config/your_data/ar_or_nar.yml
```
If needed, you can pause the training by typing quit in your command line interface, and your progress will be automatically saved.

Exporting the Trained Model

Upon training completion, export both AR and NAR models to a specific file path:

python -m vall_e.export zoo/ar_or_nar.pt yaml=config/your_data/ar_or_nar.yml

This ensures the latest checkpoint is saved and ready for use.

Synthesizing Audio

To synthesize speech from text using your trained models, execute:

python -m vall_e <text> <ref_path> <out_path> --ar-ckpt zoo/ar.pt --nar-ckpt zoo/nar.pt

Future Developments

The project continues to evolve, with plans to release a pretrained checkpoint and demos using the LibriTTS dataset, enhancing the overall capabilities of VALL-E.

Licensing and Citations

VALL-E uses EnCodec, which is under the CC-BY-NC 4.0 license. Users should respect and adhere to this license when utilizing the code for audio quantization or decoding purposes.

For academic purposes, relevant citations are provided:

@article{wang2023neural,
  title={Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers},
  author={Wang, Chengyi and Chen, Sanyuan and Wu, Yu and others},
  journal={arXiv preprint arXiv:2301.02111},
  year={2023}
}

@article{defossez2022highfi,
  title={High Fidelity Neural Audio Compression},
  author={Défossez, Alexandre and Copet, Jade and others},
  journal={arXiv preprint arXiv:2210.13438},
  year={2022}
}