UniCATS-CTX-vec2wav - Context-Aware Vocoder for Enhanced Acoustic Integration

Introduction to CTX-vec2wav: An Acoustic Context-Aware Vocoder

CTX-vec2wav is an innovative vocoder introduced in the paper "UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding," published for AAAI-2024. This project aims to enhance how computers generate human-like speech by incorporating acoustic context, thereby producing more natural and contextually appropriate audio. The project forms part of a larger framework and is closely related to another component, CTX-txt2vec.

Environment Setup

To get started with CTX-vec2wav, developers need a Linux environment with Python 3.9 installed. Users can set up the environment using conda, a popular package manager for Python.

Create a virtual environment:

conda create -n ctxv2w python=3.9
conda activate ctxv2w
pip install -r requirements.txt

Configuration: Run the provided script to set necessary paths:
```
source path.sh  
```
Permissions: Ensure all utility scripts are executable:
```
chmod +x utils/*
```

Developers also need bash and perl commands installed on their Linux systems.

Inference (Vocoding with Acoustic Context)

For registered utterances, inference is a streamlined process that utilizes a voice-quality index and an acoustic prompt to produce speech output.

Basic Inference: To vocode collected utterances, use the following command:
```
bash run.sh --stage 3 --stop_stage 3
```

Custom Inference: For more tailored tasks, users can create subsets for testing:

subset_data_dir.sh data/eval_all 200 data/eval_subset
bash run.sh --stage 3 --stop_stage 3 --eval_set "eval_subset"

Manual Inference: Users can prepare their own VQ index and acoustic prompts, following these steps:

Create a feats.scp file with utterances and their VQ index sequences.

Calculate frame numbers using:

feat-to-len.py scp:/path/to/feats.scp > /path/to/utt2num_frames

Prepare a prompt.scp for the acoustic prompts.

Execute the decode process:

decode.py \
  --sampling-rate 16000 \
  --feats-scp /path/to/feats.scp \
  --prompt-scp /path/to/prompt.scp \
  --num-frames /path/to/utt2num_frames \
  --config /path/to/config.yaml \
  --vq-codebook /path/to/codebook.npy \
  --checkpoint /path/to/checkpoint \
  --outdir /path/to/output/wav

Training Process

Setting up for training involves constructing a data and features directory. The project provides a 16kHz model variant, though the original data was in 24kHz. After preparation, training on datasets like LibriTTS is straightforward:

Start Training:
```
bash run.sh --stage 2 --stop_stage 2
```

Adjust configurations as necessary via command line arguments.

Pre-trained Model Parameters

The project offers pre-trained models in two versions, catering to different waveform sampling rates:

16kHz Model: Available here
24kHz Model: Available here, paired with a specific configuration.

A provided CMVN file ensures accurate inference across different datasets by normalizing input features.

Acknowledgement and References

The development of CTX-vec2wav draws on frameworks and existing projects, including:

ESPnet: Used for network model integration.
Kaldi: Incorporated for utility scripting.
ParallelWaveGAN: Adapted for its training and decoding methodologies.

Citation

For academic and research references, please cite the following paper:

@article{du2023unicats,
  title={UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding},
  author={Du, Chenpeng and Guo, Yiwei and Shen, Feiyu and others},
  journal={arXiv preprint arXiv:2306.07547},
  year={2023}
}

By integrating context-awareness in text-to-speech systems, CTX-vec2wav represents a significant step toward more expressive and natural speech synthesis technologies.