DiffGAN-TTS - Achieve Natural Sounding Text-to-Speech with Diffusion GANs Using PyTorch

DiffGAN-TTS Project Overview

DiffGAN-TTS is a text-to-speech (TTS) system implemented in PyTorch, designed to achieve high fidelity and efficiency using a technology called Denoising Diffusion GANs. This system aims to convert text into speech with a high degree of accuracy and naturalness.

Key Features of DiffGAN-TTS

Naive Version of DiffGAN-TTS: The basic version of the DiffGAN-TTS model has been implemented and tested.
Active Shallow Diffusion Mechanism: A more advanced two-stage version called DiffGAN-TTS is available, which uses a shallow diffusion method for better performance.

Audio Samples

To hear samples of the audio produced by DiffGAN-TTS, visit the demo page.

Getting Started

Installation

To use DiffGAN-TTS, ensure you have the required Python dependencies by executing the following command:

pip3 install -r requirements.txt

Model Inference

You can generate speech from text using pre-trained models. For a single speaker TTS, use the following command:

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --model MODEL --restore_step RESTORE_STEP --mode single --dataset DATASET

For multi-speaker TTS, which allows specifying different speaker identities, use:

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --model MODEL --speaker_id SPEAKER_ID --restore_step RESTORE_STEP --mode single --dataset DATASET

Batch Processing

DiffGAN-TTS supports batch inference, allowing you to synthesize speech for multiple texts in a single operation:

python3 synthesize.py --source preprocessed_data/DATASET/val.txt --model MODEL --restore_step RESTORE_STEP --mode batch --dataset DATASET

Controllability

You can control aspects of the speech like pitch, volume, and speaking rate by adjusting certain parameters. For example, altering the speaking rate and volume can be achieved with:

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --model MODEL --restore_step RESTORE_STEP --mode single --dataset DATASET --duration_control 0.8 --energy_control 0.8

Training Your Model

Datasets

DiffGAN-TTS supports two primary datasets:

LJSpeech: A single-speaker dataset consisting of thousands of audio clips from one English-speaking female.
VCTK: A multi-speaker dataset with recordings from over 100 speakers with different accents.

Preprocessing

For multi-speaker tasks, additional speaker embedding is required. Pre-trained models can be downloaded and used for embedding generation:

python3 prepare_align.py --dataset DATASET

Use the Montreal Forced Aligner for alignment tasks, and proceed with:

python3 preprocess.py --dataset DATASET

Training Process

Naive Model Training:

python3 train.py --model naive --dataset DATASET

Auxiliary Model for Shallow Version: Train auxiliary models required for the shallow version. This involves FastSpeech2 components:
```
python3 train.py --model aux --dataset DATASET
```
Shallow Model Training: Leverage the pre-trained models to train the advanced shallow version:
```
python3 train.py --model shallow --restore_step RESTORE_STEP --dataset DATASET
```

Monitoring and Notes

TensorBoard

You can monitor training progress via TensorBoard:

tensorboard --logdir output/log/DATASET

Additional Information

The Variance Adaptor in DiffGAN-TTS uses speaker information for conditioning.
The JCU discriminator output is averaged during loss calculation.

Citation and References

The repository can be cited using the details provided in the "About" section of the project's main page. Key references include other related TTS projects and scientific publications on diffusion models and GANs.

This comprehensive overview offers a starting point for users and developers to understand and utilize the DiffGAN-TTS project, whether for experimentation, research, or practical applications in speech synthesis.