vocos - Efficient and High-Fidelity Audio Generation Using Neural Vocoding Techniques

Vocos: Bridging the Gap in Audio Synthesis

Vocos is an innovative neural vocoder that harnesses the power of Generative Adversarial Networks (GANs) for high-quality audio synthesis. Unlike traditional vocoders that work in the time domain, Vocos performs its magic by generating spectral coefficients. This approach enables rapid audio reconstruction using the inverse Fourier transform, making it both efficient and effective in creating natural-sounding audio.

Installation

For those interested in using Vocos merely for audio inference, installation is straightforward:

pip install vocos

If you plan to train the model, additional dependencies can be installed with:

pip install vocos[train]

Usage

Reconstructing Audio from Mel-Spectrogram

With Vocos, synthesizing audio from a mel-spectrogram is seamless. Here’s a simple example in Python:

import torch
from vocos import Vocos

vocos = Vocos.from_pretrained("charactr/vocos-mel-24khz")

mel = torch.randn(1, 100, 256)  # B, C, T
audio = vocos.decode(mel)

Copy-Synthesis from a File

Vocos can also reconstruct audio directly from a file. Here’s how to do it:

import torchaudio

y, sr = torchaudio.load(YOUR_AUDIO_FILE)
if y.size(0) > 1:  # mix to mono
    y = y.mean(dim=0, keepdim=True)
y = torchaudio.functional.resample(y, orig_freq=sr, new_freq=24000)
y_hat = vocos(y)

Audio Reconstruction from EnCodec Tokens

Vocos can transform EnCodec tokens into audio, encouraging adaptability across various bandwidths:

vocos = Vocos.from_pretrained("charactr/vocos-encodec-24khz")

audio_tokens = torch.randint(low=0, high=1024, size=(8, 200))  # 8 codeboooks, 200 frames
features = vocos.codes_to_features(audio_tokens)
bandwidth_id = torch.tensor([2])  # 6 kbps

audio = vocos.decode(features, bandwidth_id=bandwidth_id)

Integration with Text-to-Audio Models

Vocos works efficiently with text-to-audio models like Bark. An illustrative example of this integration is found in their example notebook.

Pre-trained Models

Vocos offers pre-trained models trained on different datasets:

charactr/vocos-mel-24khz: This model was trained on the LibriTTS dataset and boasts 13.5 million parameters.
charactr/vocos-encodec-24khz: A slightly leaner model with 7.9 million parameters, trained for 2 million iterations on the DNS Challenge dataset.

Training Your Model

To train your version of Vocos, begin by organizing audio files into a training and validation set:

find $TRAIN_DATASET_DIR -name *.wav > filelist.train
find $VAL_DATASET_DIR -name *.wav > filelist.val

Next, customize the configuration file (e.g., vocos.yaml), and kickstart the training process:

python train.py -c configs/vocos.yaml

For more intricate training adjustments, refer to the Pytorch Lightning documentation.

License and Citation

Vocos is distributed under the MIT license, and users are encouraged to cite the corresponding research paper if Vocos significantly contributes to their work:

@article{siuzdak2023vocos,
  title={Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis},
  author={Siuzdak, Hubert},
  journal={arXiv preprint arXiv:2306.00814},
  year={2023}
}

Vocos stands out as a promising tool for audio professionals and enthusiasts alike, simplifying high-quality audio synthesis and bridging the gap between traditional and modern vocoder techniques.