diffwave - Diffusion-Based Neural Vocoder with Iterative Audio Refinement

Introducing DiffWave: An Advanced Neural Vocoder and Waveform Synthesizer

DiffWave is an innovative and high-performance neural vocoder designed to synthesize waveforms efficiently and with exceptional quality. This tool stands out due to its capability to transform Gaussian noise into comprehensible speech through an iterative refinement process. Users have the option to control the speech synthesis by providing a conditioning signal, such as a log-scaled Mel spectrogram. The technical details of the model and its architectural design are thoroughly explained in the paper titled "DiffWave: A Versatile Diffusion Model for Audio Synthesis."

Recent Updates

November 9, 2021

Added the capability for unconditional waveform synthesis, thanks to the contributions of Andrechang.

April 1, 2021

Introduced a fast sampling algorithm based on the third version of the DiffWave paper.

October 14, 2020

Released a new pre-trained model that has been trained over 1 million steps.
Updated audio samples illustrating the output from the new model.

Project Status (as of November 9, 2021)

DiffWave has achieved several milestones ensuring its robustness and usability:

Fast inference procedures
Stable training processes
High-quality synthesis output
Mixed-precision and multi-GPU training support
Command-line and programmatic inference APIs
Availability on PyPI as an installable package
Provision of audio samples and pre-trained models
Unconditional waveform synthesis capability

Special thanks to Zhifeng Kong, the lead author of DiffWave, for his invaluable guidance and bug fixes.

Experience DiffWave

You can explore several audio samples at 22.05 kHz to get a feel for the quality of audio that DiffWave can produce.

Access Pretrained Models

A pre-trained model is available, which can synthesize speech while maintaining a real-time factor of 0.87 (the lower, the faster). This model has been extensively trained on 4 NVIDIA 1080Ti GPUs with the default set of parameters, using the LJSpeech dataset (excluding samples labeled LJ001* and LJ002*), over 1,000,578 steps.

Installation

To install DiffWave, users can choose either pip or cloning from GitHub:

Installation via pip:

pip install diffwave

Installation from GitHub:

git clone https://github.com/lmnt-com/diffwave.git
cd diffwave
pip install .

Training Guidance

Before initiating training, prepare a dataset with .wav files in 16-bit mono format. A common example is the LJSpeech dataset. By default, the sample rate should be 22.05 kHz, with adjustments made via the params.py file, if necessary.

Once the dataset is ready, execute the following commands:

python -m diffwave.preprocess /path/to/dir/containing/wavs
python -m diffwave /path/to/model/dir /path/to/dir/containing/wavs

Simultaneously, monitor the training progress using:

tensorboard --logdir /path/to/model/dir --bind_all

After approximately 8,000 steps (roughly 1.5 hours on a 2080 Ti), intelligible, albeit noisy, speech should be audible.

Multi-GPU Training

DiffWave utilizes all available GPUs in parallel by default. Users can specify the GPUs by setting the CUDA_DEVICES_AVAILABLE environment variable prior to running the training module.

Using the Inference API

Here's a basic example of using the inference API to synthesize audio:

from diffwave.inference import predict as diffwave_predict

model_dir = '/path/to/model/dir'
spectrogram = # acquire your spectrogram in [N,C,W] format
audio, sample_rate = diffwave_predict(spectrogram, model_dir, fast_sampling=True)

# audio outputs as a GPU tensor in [N,T] format.

Command-Line Inference

To perform inference via the command line, use:

python -m diffwave.inference --fast /path/to/model /path/to/spectrogram -o output.wav

Academic References

DiffWave is an ongoing project with its foundations in the following references:

DiffWave offers a powerful and flexible solution for audio synthesis tasks, appealing to both researchers and practitioners in the field.