Introducing VALL-E
VALL-E is an unofficial PyTorch implementation designed to bring a new level of performance to text-to-speech synthesis. This innovative project is based on the EnCodec tokenizer, a cutting-edge tool developed by Facebook Research. The VALL-E project showcases several impressive features for those interested in deep learning, natural language processing, and audio synthesis.
Getting Started with VALL-E
For those eager to dive into VALL-E, there's a simple Google Colab example available. While it only focuses on overfitting a single utterance, it can serve as a great starting point for understanding the project's functionalities. Please note that, as of now, the pretrained model is still underway.
Prerequisites and Installation
To begin using VALL-E, a GPU supported by DeepSpeed is required, along with a pre-installed CUDA or ROCm compiler. This is because the trainer component relies on DeepSpeed, which has specific requirements for successful operation.
Installing VALL-E is straightforward and can be done using the following command:
pip install git+https://github.com/enhuiz/vall-e
Alternatively, one can clone the repository directly:
git clone --recurse-submodules https://github.com/enhuiz/vall-e.git
It's important to note that the VALL-E code has been tested specifically with Python version 3.10.7.
Training the VALL-E Model
The training process involves several steps:
-
Data Preparation: Gather your audio files (in
.wav
format) and corresponding text files (with a.normalized.txt
suffix) in a designated folder. -
Quantizing Data: Convert the audio data into quantized representations using the command:
python -m vall_e.emb.qnt data/your_data
-
Generating Phonemes: Transform text files into phonemes with the following command:
python -m vall_e.emb.g2p data/your_data
-
Configuration Setup: Create configuration files in
config/your_data
for both AR and NAR models. Reference example configurations for guidance found inconfig/test
andvall_e/config.py
. -
Model Training: Train your models using:
python -m vall_e.train yaml=config/your_data/ar_or_nar.yml
If needed, you can pause the training by typing
quit
in your command line interface, and your progress will be automatically saved.
Exporting the Trained Model
Upon training completion, export both AR and NAR models to a specific file path:
python -m vall_e.export zoo/ar_or_nar.pt yaml=config/your_data/ar_or_nar.yml
This ensures the latest checkpoint is saved and ready for use.
Synthesizing Audio
To synthesize speech from text using your trained models, execute:
python -m vall_e <text> <ref_path> <out_path> --ar-ckpt zoo/ar.pt --nar-ckpt zoo/nar.pt
Future Developments
The project continues to evolve, with plans to release a pretrained checkpoint and demos using the LibriTTS dataset, enhancing the overall capabilities of VALL-E.
Licensing and Citations
VALL-E uses EnCodec, which is under the CC-BY-NC 4.0 license. Users should respect and adhere to this license when utilizing the code for audio quantization or decoding purposes.
For academic purposes, relevant citations are provided:
@article{wang2023neural,
title={Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers},
author={Wang, Chengyi and Chen, Sanyuan and Wu, Yu and others},
journal={arXiv preprint arXiv:2301.02111},
year={2023}
}
@article{defossez2022highfi,
title={High Fidelity Neural Audio Compression},
author={Défossez, Alexandre and Copet, Jade and others},
journal={arXiv preprint arXiv:2210.13438},
year={2022}
}