univnet - Neural Vocoder for Superior Waveform Generation with Multi-Resolution Discriminators

UnivNet: Enhancing Audio Experiences with Advanced Neural Vocoding

UnivNet, developed by Jang et al. from Kakao, is a sophisticated neural vocoder designed to generate high-fidelity waveforms. This tool is part of a growing field of neural networks aimed at improving the quality and speed of audio synthesis. Built on the PyTorch framework, UnivNet stands out for its high performance and efficiency.

Key Features of UnivNet

UnivNet employs multi-resolution spectrogram discriminators that enable it to produce superior audio results. Compared to other neural vocoders, particularly GAN-based models like HiFi-GAN, UnivNet offers notable advantages:

Objective and Subjective Excellence: It surpasses HiFi-GAN in both objective metrics and subjective evaluations, meaning it not only scores better on technical tests but also sounds better to human listeners.
Speed Efficiency: UnivNet is about 1.5 times faster during inference than HiFi-GAN, making it a more efficient choice for real-time applications.

Moreover, UnivNet is compatible with the same mel-spectrogram functions used by the Official HiFi-GAN and works seamlessly with NVIDIA's Tacotron2. Users can tweak its hyperparameters to suit their acoustic models.

Data Preparation and Training

To train UnivNet, users should prepare a dataset of audio files, preferably with a sampling rate of 24,000 Hz. The LibriTTS dataset, particularly the train-clean-360 split, is recommended for its effectiveness. Metadata should be organized in a format compatible with NVIDIA's Tacotron2 format.

The training process involves setting up configuration files to define paths for training and validation data. Users can switch between UnivNet-c16 and c32 models by adjusting the channel size in the configuration settings.

Inference and Pre-trained Models

Once trained, UnivNet can be used to infer from mel-spectrogram inputs, producing high-quality audio outputs. Pre-trained models are available for quick access via Google Drive links, with separate models for UnivNet-c16 and c32 trained on LibriTTS data.

Results and Performance

UnivNet demonstrates impressive performance metrics. It showcases higher PESQ scores and lower RMSE values than HiFi-GAN, indicating better perceived sound quality and reduced error in waveform reproduction. The model is also compact, with UnivNet-c16 being notably lightweight.

Contributions and Licensing

This project was implemented and enhanced by Kang-wook Kim and Wonbin Jung at MINDsLab Inc. Contributions and special acknowledgments go to several collaborators and thinkers who have furthered its development. The code is open-source, licensed under the BSD 3-Clause License, ensuring that it can be freely used and modified by users worldwide.

UnivNet represents a leap forward in audio synthesis technology, offering a powerful tool for developers, researchers, and audio engineers to create immersive and high-fidelity audio content efficiently. By leveraging its advanced features and streamlined workflow, users can push the boundaries of what's possible in sound design and audio production.