wavegrad - Efficient Neural Vocoder with Flexible Inference and Customization Capabilities

Introduction to WaveGrad

WaveGrad is an innovative neural vocoder developed by researchers at Google Brain. This project is designed to transform log-scaled Mel spectrograms into high-quality audio waveforms through a method of iterative refinement. WaveGrad is recognized for its speed and high synthesis quality, making it a significant tool in waveform generation.

Project Overview

WaveGrad is grounded in the research paper titled WaveGrad: Estimating Gradients for Waveform Generation. The project performs the complex task of generating audio waveforms—a process essential for various applications in music production, digital assistants, and more.

Key Features

As of October 15, 2020, WaveGrad has several important features:

Stable Training: Supports audio sampling rates of 22 kHz and 24 kHz.
High-Quality Synthesis: Ensures superior audio output quality.
Mixed-Precision Training: Utilizes efficient computational methods for training.
Multi-GPU Training: Comes with the capability to leverage multiple GPUs, speeding up the training process.
Custom Noise Schedule: Allows faster inference by adjusting the noise schedule.
Inference Options: Offers both command-line and programmatic inference APIs to suit different user needs.
Convenience: Available as a package on PyPI, with handy examples and pretrained models for easy use.

Audio Samples and Pretrained Models

Users can explore the 24 kHz audio samples to evaluate the output quality. A 24 kHz pretrained model is also available for immediate use, facilitating a quick start for those interested in deploying the model without going through the entire training process.

Installation and Training

Installing WaveGrad is straightforward. It can be done using pip:

pip install wavegrad

Alternatively, users can clone the project from GitHub and set it up manually:

git clone https://github.com/lmnt-com/wavegrad.git
cd wavegrad
pip install .

Before training, users need to prepare a dataset of .wav files, which should be in 16-bit mono format. Popular datasets include LJSpeech and VCTK. Training can be monitored using TensorBoard, and intelligible speech can typically be achieved after around 20,000 training steps.

Inference APIs and CLI

WaveGrad provides a comprehensive interface for inference, allowing users to convert spectrograms into waveforms as GPU tensors. Users can also implement custom noise schedules to optimize speed and quality during the process. An example of basic inference usage is as follows:

from wavegrad.inference import predict as wavegrad_predict

model_dir = '/path/to/model/dir'
spectrogram = # obtain spectrogram in [N,C,W] format
audio, sample_rate = wavegrad_predict(spectrogram, model_dir)

For command-line inference:

python -m wavegrad.inference /path/to/model /path/to/spectrogram -o output.wav

Optimizing with Noise Schedule

WaveGrad can produce high-quality audio at high speeds with a well-chosen noise schedule, sometimes reducing iterations to as few as six without retraining the model. Users can search for an optimal noise schedule using a provided script to tailor synthesis to their dataset needs.

References and Acknowledgements

WaveGrad builds on the concepts discussed in various research papers such as WaveGrad: Estimating Gradients for Waveform Generation and Denoising Diffusion Probabilistic Models. For those interested in further exploration, links to these works and related code libraries are available in the project's documentation. WaveGrad represents a step forward in vocoder technology, offering both efficiency and quality to its users.