snac - Streamline Audio with Multi-Scale Neural Codec Technology

Introduction to SNAC: Multi-Scale Neural Audio Codec

SNAC, which stands for Multi-Scale Neural Audio Codec, is a technological advancement designed to effectively compress audio into discrete codes at very low bitrates. This capability marks a significant stride in audio coding, offering efficient alternatives for audio processing in diverse applications, from music production to speech recognition.

Audio Samples

SNAC handles both music and speech audio efficiently. To get a sense of its performance, one can listen to sample audio provided on platforms like GitHub, where examples of both music and speech processed through SNAC are available.

How SNAC Works

The core function of SNAC revolves around encoding audio into hierarchical tokens akin to other systems like SoundStream, EnCodec, and DAC. What sets SNAC apart is its innovative approach to sampling. By sampling coarse tokens less frequently, SNAC is able to cover larger time spans more efficiently. This technique not only conserves bitrate but also enhances audio generation capabilities, particularly in language modeling contexts. For example, by utilizing coarse tokens at a frequency of about 10 Hz paired with a context window of 2048, it is possible to maintain a coherent audio track structure for durations up to approximately three minutes.

Pretrained Models

SNAC comes with pretrained models that only support single audio channel processing (mono channel). There are several models, each suited to different applications:

hubertsiuzdak/snac_24khz: Operates at 0.98 kbps, with a sample rate of 24 kHz. It is ideal for speech processing and contains 19.8 million parameters.
hubertsiuzdak/snac_32khz: Works at 1.9 kbps and a 32 kHz sample rate. It is recommended for music or sound effect applications and features 54.5 million parameters.
hubertsiuzdak/snac_44khz: This runs at 2.6 kbps with a 44 kHz sample rate, also targeting music and sound effects, with 54.5 million parameters.

How to Use SNAC

For those looking to integrate SNAC into their projects, the installation is straightforward. It can be installed using pip:

pip install snac

To encode and decode audio using SNAC in Python, the following code snippet illustrates the process:

import torch
from snac import SNAC

model = SNAC.from_pretrained("hubertsiuzdak/snac_32khz").eval().cuda()
audio = torch.randn(1, 1, 32000).cuda()  # Placeholder for actual audio

with torch.inference_mode():
    codes = model.encode(audio)
    audio_hat = model.decode(codes)

Moreover, users can encode and reconstruct audio in a single call:

with torch.inference_mode():
    audio_hat, codes = model(audio)

Importantly, the codes returned are a list of token sequences varying in length and corresponding to different temporal resolutions.

>>> [code.shape[1] for code in codes]
[12, 24, 48, 96]

Acknowledgements

The module definitions for SNAC have been adapted from the Descript Audio Codec, a fact that recognizes the foundational contributions to this advanced audio codec technology.

Overall, SNAC offers a promising approach to audio compression, especially beneficial in scenarios where low bitrate and high-quality audio are crucial. Its innovative sampling methods and pretrained models allow for diverse and flexible applications in the field of audio processing.