DiffGAN-TTS Project Overview
DiffGAN-TTS is a text-to-speech (TTS) system implemented in PyTorch, designed to achieve high fidelity and efficiency using a technology called Denoising Diffusion GANs. This system aims to convert text into speech with a high degree of accuracy and naturalness.
Key Features of DiffGAN-TTS
- Naive Version of DiffGAN-TTS: The basic version of the DiffGAN-TTS model has been implemented and tested.
- Active Shallow Diffusion Mechanism: A more advanced two-stage version called DiffGAN-TTS is available, which uses a shallow diffusion method for better performance.
Audio Samples
To hear samples of the audio produced by DiffGAN-TTS, visit the demo page.
Getting Started
Installation
To use DiffGAN-TTS, ensure you have the required Python dependencies by executing the following command:
pip3 install -r requirements.txt
Model Inference
You can generate speech from text using pre-trained models. For a single speaker TTS, use the following command:
python3 synthesize.py --text "YOUR_DESIRED_TEXT" --model MODEL --restore_step RESTORE_STEP --mode single --dataset DATASET
For multi-speaker TTS, which allows specifying different speaker identities, use:
python3 synthesize.py --text "YOUR_DESIRED_TEXT" --model MODEL --speaker_id SPEAKER_ID --restore_step RESTORE_STEP --mode single --dataset DATASET
Batch Processing
DiffGAN-TTS supports batch inference, allowing you to synthesize speech for multiple texts in a single operation:
python3 synthesize.py --source preprocessed_data/DATASET/val.txt --model MODEL --restore_step RESTORE_STEP --mode batch --dataset DATASET
Controllability
You can control aspects of the speech like pitch, volume, and speaking rate by adjusting certain parameters. For example, altering the speaking rate and volume can be achieved with:
python3 synthesize.py --text "YOUR_DESIRED_TEXT" --model MODEL --restore_step RESTORE_STEP --mode single --dataset DATASET --duration_control 0.8 --energy_control 0.8
Training Your Model
Datasets
DiffGAN-TTS supports two primary datasets:
- LJSpeech: A single-speaker dataset consisting of thousands of audio clips from one English-speaking female.
- VCTK: A multi-speaker dataset with recordings from over 100 speakers with different accents.
Preprocessing
For multi-speaker tasks, additional speaker embedding is required. Pre-trained models can be downloaded and used for embedding generation:
python3 prepare_align.py --dataset DATASET
Use the Montreal Forced Aligner for alignment tasks, and proceed with:
python3 preprocess.py --dataset DATASET
Training Process
-
Naive Model Training:
python3 train.py --model naive --dataset DATASET
-
Auxiliary Model for Shallow Version: Train auxiliary models required for the shallow version. This involves FastSpeech2 components:
python3 train.py --model aux --dataset DATASET
-
Shallow Model Training: Leverage the pre-trained models to train the advanced shallow version:
python3 train.py --model shallow --restore_step RESTORE_STEP --dataset DATASET
Monitoring and Notes
TensorBoard
You can monitor training progress via TensorBoard:
tensorboard --logdir output/log/DATASET
Additional Information
- The Variance Adaptor in DiffGAN-TTS uses speaker information for conditioning.
- The JCU discriminator output is averaged during loss calculation.
Citation and References
The repository can be cited using the details provided in the "About" section of the project's main page. Key references include other related TTS projects and scientific publications on diffusion models and GANs.
This comprehensive overview offers a starting point for users and developers to understand and utilize the DiffGAN-TTS project, whether for experimentation, research, or practical applications in speech synthesis.