Introduction to CTX-vec2wav: An Acoustic Context-Aware Vocoder
CTX-vec2wav is an innovative vocoder introduced in the paper "UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding," published for AAAI-2024. This project aims to enhance how computers generate human-like speech by incorporating acoustic context, thereby producing more natural and contextually appropriate audio. The project forms part of a larger framework and is closely related to another component, CTX-txt2vec.
Environment Setup
To get started with CTX-vec2wav, developers need a Linux environment with Python 3.9 installed. Users can set up the environment using conda, a popular package manager for Python.
-
Create a virtual environment:
conda create -n ctxv2w python=3.9 conda activate ctxv2w pip install -r requirements.txt
-
Configuration: Run the provided script to set necessary paths:
source path.sh
-
Permissions: Ensure all utility scripts are executable:
chmod +x utils/*
Developers also need bash
and perl
commands installed on their Linux systems.
Inference (Vocoding with Acoustic Context)
For registered utterances, inference is a streamlined process that utilizes a voice-quality index and an acoustic prompt to produce speech output.
-
Basic Inference: To vocode collected utterances, use the following command:
bash run.sh --stage 3 --stop_stage 3
-
Custom Inference: For more tailored tasks, users can create subsets for testing:
subset_data_dir.sh data/eval_all 200 data/eval_subset bash run.sh --stage 3 --stop_stage 3 --eval_set "eval_subset"
-
Manual Inference: Users can prepare their own VQ index and acoustic prompts, following these steps:
- Create a
feats.scp
file with utterances and their VQ index sequences. - Calculate frame numbers using:
feat-to-len.py scp:/path/to/feats.scp > /path/to/utt2num_frames
- Prepare a
prompt.scp
for the acoustic prompts. - Execute the decode process:
decode.py \ --sampling-rate 16000 \ --feats-scp /path/to/feats.scp \ --prompt-scp /path/to/prompt.scp \ --num-frames /path/to/utt2num_frames \ --config /path/to/config.yaml \ --vq-codebook /path/to/codebook.npy \ --checkpoint /path/to/checkpoint \ --outdir /path/to/output/wav
- Create a
Training Process
Setting up for training involves constructing a data and features directory. The project provides a 16kHz model variant, though the original data was in 24kHz. After preparation, training on datasets like LibriTTS is straightforward:
- Start Training:
bash run.sh --stage 2 --stop_stage 2
Adjust configurations as necessary via command line arguments.
Pre-trained Model Parameters
The project offers pre-trained models in two versions, catering to different waveform sampling rates:
A provided CMVN file ensures accurate inference across different datasets by normalizing input features.
Acknowledgement and References
The development of CTX-vec2wav draws on frameworks and existing projects, including:
- ESPnet: Used for network model integration.
- Kaldi: Incorporated for utility scripting.
- ParallelWaveGAN: Adapted for its training and decoding methodologies.
Citation
For academic and research references, please cite the following paper:
@article{du2023unicats,
title={UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding},
author={Du, Chenpeng and Guo, Yiwei and Shen, Feiyu and others},
journal={arXiv preprint arXiv:2306.07547},
year={2023}
}
By integrating context-awareness in text-to-speech systems, CTX-vec2wav represents a significant step toward more expressive and natural speech synthesis technologies.