AudioDec - Efficient Real-time Neural Audio Codec for High-Fidelity Sound

AudioDec: An Open-source Streaming High-fidelity Neural Audio Codec

Introduction

AudioDec is a revolutionary open-source project focused on providing a high-fidelity neural audio codec that enables efficient audio streaming and compression. Its primary objective is to deliver superior audio quality while maintaining low bitrates and minimal latency, making it highly suitable for real-time applications such as telecommunications.

Key Features

Supports high-quality 48 kHz mono speech audio.
Operates at a low bitrate of 12.8 kbps.
Offers exceptionally low decoding latency, achieving approximately 6 ms on a GPU and about 10 ms using a CPU with four threads.
Implements a two-stage training process designed for efficiency, allowing new encoder training in just a few hours with pre-trained models.

Detailed Overview

The Essential Attributes

In applications like live communication, an effective audio codec must fulfill three critical requirements:

Compression: The codec should minimize the required bitrate for signal transmission without sacrificing quality.
Latency: The encoding and decoding processes should be swift, ensuring seamless communication with barely noticeable or no delay.
Quality: The codec should offer high reconstruction quality, producing natural and clear sound.

AudioDec excels across these dimensions, offering outstanding performance by reconstructing natural-sounding 48 kHz speech at a mere 12 kbps and functioning with negligible latency on both GPU and CPU platforms.

Modes of AudioDec

AudioDec operates in two distinct modes:

AutoEncoder Mode (symAD):
- Initial training of an AutoEncoder-based codec model focused on metric-based losses.
- Subsequent refinement by fixing components like the encoder and training the decoder with discriminators.
AutoEncoder + Vocoder Mode (AD v0,1,2):
- Recommended for most applications.
- Involves extracting statistics from the encoder outputs and training the vocoder using these stats.

Real-world Implementation

AudioDec's versatility and efficiency extend to real-world applications, where it demonstrates its capability for real-time streaming and file-based encoding/decoding. The project offers a demo for both real-time streaming and file processing. It provides a modular architecture, enabling straightforward training, testing, and deployment of audio codec solutions.

Training and Testing

AudioDec follows a structured approach for training and testing:

Complete Pipeline: Users can train the full AudioDec system by preparing datasets, modifying configurations and paths, and executing the process with provided scripts.
Only AutoEncoder: For specific tasks, the AutoEncoder can be trained and tested independently.

Additional Functionality: Denoising

AudioDec also supports audio denoising, which involves updating the encoder using pairs of noisy and clean audio samples. The decoder remains unchanged in this process, promoting efficient denoising capability.

Pre-trained Models

AudioDec provides a range of pre-trained models for immediate use, supporting various sampling rates and bitrates tailored to different datasets such as VCTK and LibriTTS.

Conclusion

AudioDec is a comprehensive tool for those seeking high-fidelity, low-latency audio codecs. Its open-source nature, coupled with an efficient and flexible architecture, makes it a powerful choice for research and practical applications in audio processing.