VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching
VoiceFlow is a sophisticated text-to-speech (TTS) system that leverages the power of rectified flow matching for efficient voice synthesis. Officially presented in the ICASSP 2024 paper, this project introduces a novel approach to TTS, promising an enhanced balance between speed and quality in generating human-like speech.
Environment Setup
The project is designed to run on Python 3.9 within a Linux environment. Users can create the necessary environment using conda:
conda create -n vflow python==3.9
conda activate vflow
pip install -r requirements.txt
It's important to configure the PATH, and additional setup might require bash
and perl
commands. The repository has bypassed some complex dependencies by embedding specific versions locally, such as torchdyn
.
Data Preparation
VoiceFlow relies on a data organization structure similar to Kaldi's, where all data description files are stored in data/
subdirectories. An example setup includes files like wav.scp
, utts.list
, utt2spk
, and text
, which describe the dataset's audio files and their respective metadata.
For LJSpeech, a readily processed dataset is available for download to simplify setup. Users can also prepare custom datasets by following the same file structure.
Once data is prepared, mel-spectrogram features essential for training are extracted using:
bash extract_fbank.sh --stage 0 --stop_stage 2 --nj 16
This step converts audio into a format that the model can process effectively.
Training
The core of VoiceFlow's TTS capability is unlocked through a well-documented training process. Configuration files in YAML format control the training parameters, specifying paths to the training and validation datasets.
Training begins with:
python train.py -c configs/${your_yaml} -m ${model_name}
This command initiates the learning process, produces logs, and manages the checkpoints.
Generating Data for ReFlow and Performing Reflow
ReFlow is a crucial process in VoiceFlow, refining the model by improving the generated speech quality through a flow rectification process. After adequate training, the model generates synthetic data for further refinement.
A Python script facilitates this intricate process:
python generate_for_reflow.py -c configs/${your_yaml} -m ${model_name} --EMA --max-utt-num 100000000 --dataset train --solver euler -t 10 --gt-dur
The script processes the training set for ReFlow, producing data that is used to fine-tune the model further. This cycle can significantly enhance the model's performance.
Inference
Once the model is adequately trained, inference—the stage where the model generates speech from text—is a straightforward process:
python inference_dataset.py -c configs/${your_yaml} -m ${model_name} --EMA --solver euler -t 10
The resulting synthesized mel-spectrograms can be used to produce the final speech output using additional tools provided in the hifigan/
directory.
Acknowledgement
VoiceFlow builds on a solid foundation of existing technologies and frameworks, like Kaldi, UniCATS-CTX-vec2wav, GradTTS, VITS, and CFM. These frameworks provided essential utilities, model architectures, training pipelines, and more, facilitating the robust development of this TTS system.
Easter Eggs and Experimental Features
VoiceFlow contains innovative experimental features, including:
- Voice Conversion: Utilizing the editing capabilities of normalizing flows.
- Likelihood Estimation: Leveraging generative models for data likelihood measurement.
- Optimal Transport: Experimenting with conditionally optimal transport paths.
- Different Estimator Architectures: Offering flexibility in choosing estimator types beyond the default.
- Better Alignment Learning: Exploring alternative methods for improved model alignment.
These features highlight VoiceFlow's potential and its experimental edge in the TTS domain. The repository also invites users to explore these functionalities, though with caution due to their experimental nature.
VoiceFlow represents an exciting step forward in TTS technology, offering users the tools to create high-quality synthetic speech efficiently. The innovations in rectified flow matching distinguish it as a cutting-edge project in digital speech processing.