LLM-Codec - Efficient Cross-Modal Audio Learning with LLM-driven Codec Models

Introduction to the LLM-Codec Project

The LLM-Codec project presents an innovative approach to bridging the gap between text and audio modalities using large language models (LLMs). Traditional LLMs excel in text-based applications but struggle to handle tasks requiring cross-modal understanding without extensive recalibration. This project aims to address that by enabling LLMs to perform various audio-related tasks efficiently, without necessitating updates to their parameters.

Central to this approach is the development of a unique audio codec model, named LLM-Codec, which serves as a transformative tool converting audio inputs into text representations. By doing so, LLMs perceive audio as a new form of language using their existing vocabulary, thus reducing the disparity between text and audio modalities. In practical terms, audio clips are encoded as text-like tokens, allowing LLMs to process them similarly to text data.

The application of this method within the project has shown promising results across different audio tasks. These include speech emotion classification, general audio classification, text-to-speech conversion, and speech enhancement, among others. The LLM-Codec model is instrumental in transforming audio signals into high-quality text representations, enabling LLMs to perform these tasks effectively with minimal examples, showcasing the power of cross-modal in-context learning.

How to Use LLM-Codec?

Implementing the LLM-Codec involves a few straightforward steps:

Download the Model: Begin by obtaining the relevant model weights by running the following command:
```
wget https://huggingface.co/Dongchao/2024/resolve/main/semantic_acoustic.pth
```
LLAMA 2 Download: Next, acquire the LLAMA 2 (7B configuration) following the instructions at GitHub here.
Run Inference: Finally, execute the inference script for testing and evaluation by utilizing:
```
python infer.py
```

Using LLM-Codec with LLAMA 2 (UniAudio 1.5)

For practical application using UniAudio 1.5, execute the following command, adjusting the paths and parameters to suit your data and environment:

CUDA_VISIBLE_DEVICES=0 torchrun --nproc_per_node 1 --master_port=10645 infer_code/eval_accent_understanding_v2.py \
            --batch_size 1 \
            --max_seq_len 2048 \
            --num_workers 0 \
            --output_type "next_token_prediction" \
            --audio_path "the path of audio folder" \
            --file_path tsv/acc_9way_1_shot.scp \
            --vq_config_path config.yaml \
            --output_dir log_eval_few_shot/7B_output \
            --llama_model_path llama_inference/llama-2-7b \
            --induction 1 \
            --codec_ckpt "llm-codec.pth"

Demos

For a hands-on experience and to hear the generated audio results, you may explore the demo folder provided in the repository.

Acknowledgements

The LLM-Codec project acknowledges its dependencies and inspirations from several open-source contributions:

These resources have played a pivotal role in shaping the LLM-Codec model and its capabilities.