Introducing DiffWave: An Advanced Neural Vocoder and Waveform Synthesizer
DiffWave is an innovative and high-performance neural vocoder designed to synthesize waveforms efficiently and with exceptional quality. This tool stands out due to its capability to transform Gaussian noise into comprehensible speech through an iterative refinement process. Users have the option to control the speech synthesis by providing a conditioning signal, such as a log-scaled Mel spectrogram. The technical details of the model and its architectural design are thoroughly explained in the paper titled "DiffWave: A Versatile Diffusion Model for Audio Synthesis."
Recent Updates
November 9, 2021
- Added the capability for unconditional waveform synthesis, thanks to the contributions of Andrechang.
April 1, 2021
- Introduced a fast sampling algorithm based on the third version of the DiffWave paper.
October 14, 2020
- Released a new pre-trained model that has been trained over 1 million steps.
- Updated audio samples illustrating the output from the new model.
Project Status (as of November 9, 2021)
DiffWave has achieved several milestones ensuring its robustness and usability:
- Fast inference procedures
- Stable training processes
- High-quality synthesis output
- Mixed-precision and multi-GPU training support
- Command-line and programmatic inference APIs
- Availability on PyPI as an installable package
- Provision of audio samples and pre-trained models
- Unconditional waveform synthesis capability
Special thanks to Zhifeng Kong, the lead author of DiffWave, for his invaluable guidance and bug fixes.
Experience DiffWave
You can explore several audio samples at 22.05 kHz to get a feel for the quality of audio that DiffWave can produce.
Access Pretrained Models
A pre-trained model is available, which can synthesize speech while maintaining a real-time factor of 0.87 (the lower, the faster). This model has been extensively trained on 4 NVIDIA 1080Ti GPUs with the default set of parameters, using the LJSpeech dataset (excluding samples labeled LJ001* and LJ002*), over 1,000,578 steps.
Installation
To install DiffWave, users can choose either pip or cloning from GitHub:
Installation via pip:
pip install diffwave
Installation from GitHub:
git clone https://github.com/lmnt-com/diffwave.git
cd diffwave
pip install .
Training Guidance
Before initiating training, prepare a dataset with .wav files in 16-bit mono format. A common example is the LJSpeech dataset. By default, the sample rate should be 22.05 kHz, with adjustments made via the params.py file, if necessary.
Once the dataset is ready, execute the following commands:
python -m diffwave.preprocess /path/to/dir/containing/wavs
python -m diffwave /path/to/model/dir /path/to/dir/containing/wavs
Simultaneously, monitor the training progress using:
tensorboard --logdir /path/to/model/dir --bind_all
After approximately 8,000 steps (roughly 1.5 hours on a 2080 Ti), intelligible, albeit noisy, speech should be audible.
Multi-GPU Training
DiffWave utilizes all available GPUs in parallel by default. Users can specify the GPUs by setting the CUDA_DEVICES_AVAILABLE
environment variable prior to running the training module.
Using the Inference API
Here's a basic example of using the inference API to synthesize audio:
from diffwave.inference import predict as diffwave_predict
model_dir = '/path/to/model/dir'
spectrogram = # acquire your spectrogram in [N,C,W] format
audio, sample_rate = diffwave_predict(spectrogram, model_dir, fast_sampling=True)
# audio outputs as a GPU tensor in [N,T] format.
Command-Line Inference
To perform inference via the command line, use:
python -m diffwave.inference --fast /path/to/model /path/to/spectrogram -o output.wav
Academic References
DiffWave is an ongoing project with its foundations in the following references:
- DiffWave: A Versatile Diffusion Model for Audio Synthesis
- Denoising Diffusion Probabilistic Models
- Code for Denoising Diffusion Probabilistic Models
DiffWave offers a powerful and flexible solution for audio synthesis tasks, appealing to both researchers and practitioners in the field.