WaveGrad - Efficient Waveform Generation Using Denoising Diffusion Models

WaveGrad: High-Fidelity Waveform Generation

WaveGrad is a cutting-edge vocoder developed by the Google Brain team, designed to generate high-quality audio waveforms. Unlike traditional models that rely on Generative Adversarial Networks (GANs), Normalizing Flow, or classical autoregressive techniques, WaveGrad employs innovative methods from Denoising Diffusion Probabilistic Models (DDPM). This approach utilizes Langevin dynamics and score matching frameworks, achieving impressive results with very few iterations.

Status and Features

WaveGrad has reached several milestones in its development:

Documented API: Comprehensive documentation is available to guide users through the setup and usage.
High-Fidelity Generation: Produces high-quality audio output.
Multi-Iteration Inference Support: Particularly stable and effective for lower iteration counts.
Efficient Training: Supports mixed-precision for stable and fast training.
Distributed Training: Able to run on multiple GPUs simultaneously.
Single GPU Training: Can also run on a single 12GB GPU with a batch size of 96.
Command Line Interface (CLI) Inference: Users can perform inference through command line executions.
Flexible Architecture: Adaptable to various datasets and use cases.
Fast Inference: Inferences with 100 iterations or fewer are faster than real-time on an RTX 2080 Ti; 6-iteration inference outpaces those reported in research papers.
Parallel Grid Search: Efficiently finds the best noise schedule for different iteration counts.
Pretrained Checkpoints: Available for quick deployment on the 22KHz LJSpeech dataset.

Real-Time Factor (RTF)

WaveGrad features 15,810,401 parameters, and its real-time performance varies across hardware:

Model Iterations	RTX 2080 Ti	Tesla K80	Intel Xeon 2.3GHz
1000 iterations	9.59	-	-
100 iterations	0.94	5.85	-
50 iterations	0.45	2.92	-
25 iterations	0.22	1.45	-
12 iterations	0.10	0.69	4.55
6 iterations	0.04	0.33	2.09

Installation

To get started with WaveGrad, follow these steps:

Clone the repository:

git clone https://github.com/ivanvovk/WaveGrad.git
cd WaveGrad

Install the necessary dependencies:
```
pip install -r requirements.txt
```

Training WaveGrad

Preparing Data:

Create training and testing file lists similar to those provided in the filelists folder.
Customize a configuration file in the configs folder. Ensure the upsampling factors match your STFT's hop length if modified.

Single and Distributed GPU Training:

Edit runs/train.sh to configure GPUs and paths to your configuration files. If multiple GPUs are set, training will run in distributed mode.
Execute:
```
sh runs/train.sh
```

Tracking Progress:

Use TensorBoard to monitor training with:

tensorboard --logdir=logs/YOUR_LOGDIR_FOLDER

Tuning Noise Schedules:

For low iteration counts (such as 6-7), conduct a noise schedule grid search in notebooks/inference.ipynb. This step helps optimize model performance for specific iteration amounts.

Inference

Command Line Interface (CLI):

Prepare mel-spectrograms in a designated folder and create a file list. Run the following command with customized arguments:

sh runs/inference.sh -c <your-config> -ch <your-checkpoint> -ns <your-noise-schedule> -m <your-mel-filelist> -v "yes"

Jupyter Notebook:

Detailed inference instructions and noise schedule settings are provided in notebooks/inference.ipynb.

Additional Resources

Generated Audio Samples:

Examples of audio output across different iterations are available in the generated_samples folder. When optimized, 6-iteration output rivals the quality of 1000-iteration output.

Pretrained Model:

A pretrained model is available for download, trained on the LJSpeech dataset (22KHz). Access it via this link.

WaveGrad represents a significant advancement in waveform generation, combining efficient training and inference with high-quality audio output. With its flexible architecture and ability to utilize distributed computing, it is poised to be a valuable tool in the field of audio processing.