SpecVQGAN
This project presents an innovative method for generating sound guided by visual inputs through a spectrogram-based codebook. Using a Spectrogram VQGAN model, it trains a transformer to utilize visual features, producing coherent and high-quality audio suited for various data classes. This approach facilitates the creation of extensive, high-quality sound sequences, making it valuable for multimedia and auditory synthesis applications. The project includes comprehensive instructions on environment setup and data management, emphasizing its use in training complex models with open-source tools like Conda and Docker. Additionally, it offers access to pretrained models and transformers for sampling and evaluation, assisting users in efficiently producing sounds from visual stimuli and advancing the field of conditional sound creation.