SpecVQGAN - Spectrogram VQGAN for Visual-Guided Sound Synthesis

Taming Visually Guided Sound Generation

Overview

SpecVQGAN is a novel approach to generating sounds guided by visual content. It's a project presented at BMVC 2021, designed to control sound creation using cues from visual inputs. This method involves reducing a large set of training data to a compact collection of representative vectors, known as a codebook, which can then be sampled to create new sounds.

SpecVQGAN builds on the architecture of VQGAN, which itself is an upgrade of VQVAE, to train its codebook on spectrograms. The project proposes using this system to generate sounds that are not only long and engaging but also maintain high fidelity and are relevant to the visual content from which they are derived.

Technical Approach

SpecVQGAN employs a two-stage training process:

Spectrogram Codebook Training: The model is initially trained on spectrograms—visual representations of sound—using a method similar to VQGAN. This stage results in a codebook that captures the essential sound characteristics.
Transformer Training: Next, a transformer model, akin to GPT-2, is trained to select these codebook entries based on visual features, allowing it to produce sounds in an autoregressive manner. This means it generates sound sequences one step at a time, conditioned on the given visual inputs.

Environment Preparation

To work with SpecVQGAN, the setup begins with cloning the project repository from GitHub. The environment necessary for training and evaluation can be prepared using either Conda or Docker. Instructions for setting up and testing the environment are provided, ensuring compatibility with machine learning frameworks like PyTorch and hardware accelerations like CUDA.

Data Utilization

The project utilizes datasets such as VAS and VGGSound, which consist of videos paired with audio components. The process involves downloading necessary data features and then utilizing either pre-extracted features or extracting them manually from video files using BN Inception or ResNet50 architectures for feature extraction.

Pretrained Models

SpecVQGAN offers pretrained models which can be used to infer or continue training. These models are available for different configurations, depending on whether visual features are sourced from BN Inception or ResNet50. This flexibility allows users to experiment with different feature extraction methods and compare the sound quality using measured criteria like FID (Fréchet Inception Distance) and MKL (mean Kullback–Leibler divergence).

Evaluation and Tools

For evaluating the performance of the SpecVQGAN models, various benchmarks and tools are available. These include a sampling tool that visualizes the reconstruction results, helping users understand how closely the generated sounds mimic the actual audio from the datasets.

Conclusion

SpecVQGAN provides an innovative approach to sound generation, merging visual and sound data in a sophisticated manner. By leveraging advanced machine learning techniques and the robust frameworks provided in the project, users can explore new realms of audiovisual synthesis, creating rich soundscapes from visual cues.