ltu - State-of-the-art models for audio and speech comprehension

Listen, Think, and Understand (LTU)

Introduction

Listen, Think, and Understand (LTU) is an innovative project that has developed a large language model (LLM) designed to bridge the gap between audio/speech perception and understanding. Crafted by notable researchers from MIT and the MIT-IBM Watson AI Lab, LTU represents a significant step forward in how machines comprehend auditory information.

The LTU project includes two main components: LTU, which supports audio-only inputs, and LTU-AS, which expands support to both audio and speech inputs. These models not only excel in closed-ended tasks—like transcribing or identifying audio—but can also handle open-ended questions, offering insightful and contextual responses based on a given audio clip. The interactive demos for both LTU and LTU-AS showcase their impressive capabilities.

OpenAQA (LTU) and OpenASQA (LTU-AS) Dataset

The datasets underpinning LTU and LTU-AS are OpenAQA and OpenASQA, respectively. These datasets consist of tuples containing (question, answer, audio_id) elements, drawn from existing public audio datasets. Users must download the actual audio files, but the project provides structured access to these datasets—supporting both closed and open-ended questions—through Dropbox links for easy download and integration.

Setting Up the Virtual Environment

To use the LTU or LTU-AS models, users need to set up a dedicated virtual environment. This ensures compatibility and avoids conflicts with dependencies. Separate environments must be established for LTU and LTU-AS, and the process includes creating a conda environment and installing custom versions of key libraries like Hugging Face Transformers, PEFT, and OpenAI Whisper.

Inference Options

Users have multiple avenues for running inference with LTU and LTU-AS:

HuggingFace Space: This option requires no coding and provides access to interactive demos online.
API Access: Users can perform batch inference using an API without the need for a GPU, making it accessible and efficient.
Local Inference: For those wishing to run the models locally, scripts are available to set up and run inference directly on a user's machine (with or without GPU support).

Finetuning LTU and LTU-AS

The LTU project supports finetuning of models to adapt them to specific datasets or tasks. Both LTU and LTU-AS come with scripts for finetuning using either sample (toy) data or user-provided data. The process involves setting up the training environment and running scripts to adjust the model's parameters based on the new data.

Reproducing LTU and LTU-AS Training

Reproducing the training regimen outlined in the LTU papers involves a multi-stage curriculum approach. This approach ensures a thorough adaptation of the models across different types of tasks, starting from basic classification to more complex open-question answering.

Pretrained Models

While many applications handle model download automatically, LTU provides access to various checkpoints and pretrained models, enabling users to explore different configurations or start points for their experiments.

Through these comprehensive resources and guidelines, LTU and LTU-AS offer cutting-edge technology to integrate sophisticated audio understanding into diverse real-world applications, democratizing access to advanced machine listening capabilities.