Vocos: Bridging the Gap in Audio Synthesis
Vocos is an innovative neural vocoder that harnesses the power of Generative Adversarial Networks (GANs) for high-quality audio synthesis. Unlike traditional vocoders that work in the time domain, Vocos performs its magic by generating spectral coefficients. This approach enables rapid audio reconstruction using the inverse Fourier transform, making it both efficient and effective in creating natural-sounding audio.
Installation
For those interested in using Vocos merely for audio inference, installation is straightforward:
pip install vocos
If you plan to train the model, additional dependencies can be installed with:
pip install vocos[train]
Usage
Reconstructing Audio from Mel-Spectrogram
With Vocos, synthesizing audio from a mel-spectrogram is seamless. Here’s a simple example in Python:
import torch
from vocos import Vocos
vocos = Vocos.from_pretrained("charactr/vocos-mel-24khz")
mel = torch.randn(1, 100, 256) # B, C, T
audio = vocos.decode(mel)
Copy-Synthesis from a File
Vocos can also reconstruct audio directly from a file. Here’s how to do it:
import torchaudio
y, sr = torchaudio.load(YOUR_AUDIO_FILE)
if y.size(0) > 1: # mix to mono
y = y.mean(dim=0, keepdim=True)
y = torchaudio.functional.resample(y, orig_freq=sr, new_freq=24000)
y_hat = vocos(y)
Audio Reconstruction from EnCodec Tokens
Vocos can transform EnCodec tokens into audio, encouraging adaptability across various bandwidths:
vocos = Vocos.from_pretrained("charactr/vocos-encodec-24khz")
audio_tokens = torch.randint(low=0, high=1024, size=(8, 200)) # 8 codeboooks, 200 frames
features = vocos.codes_to_features(audio_tokens)
bandwidth_id = torch.tensor([2]) # 6 kbps
audio = vocos.decode(features, bandwidth_id=bandwidth_id)
Integration with Text-to-Audio Models
Vocos works efficiently with text-to-audio models like Bark. An illustrative example of this integration is found in their example notebook.
Pre-trained Models
Vocos offers pre-trained models trained on different datasets:
- charactr/vocos-mel-24khz: This model was trained on the LibriTTS dataset and boasts 13.5 million parameters.
- charactr/vocos-encodec-24khz: A slightly leaner model with 7.9 million parameters, trained for 2 million iterations on the DNS Challenge dataset.
Training Your Model
To train your version of Vocos, begin by organizing audio files into a training and validation set:
find $TRAIN_DATASET_DIR -name *.wav > filelist.train
find $VAL_DATASET_DIR -name *.wav > filelist.val
Next, customize the configuration file (e.g., vocos.yaml), and kickstart the training process:
python train.py -c configs/vocos.yaml
For more intricate training adjustments, refer to the Pytorch Lightning documentation.
License and Citation
Vocos is distributed under the MIT license, and users are encouraged to cite the corresponding research paper if Vocos significantly contributes to their work:
@article{siuzdak2023vocos,
title={Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis},
author={Siuzdak, Hubert},
journal={arXiv preprint arXiv:2306.00814},
year={2023}
}
Vocos stands out as a promising tool for audio professionals and enthusiasts alike, simplifying high-quality audio synthesis and bridging the gap between traditional and modern vocoder techniques.