VITA - The first open-source model for simultaneous video, image, text, and audio processing

VITA: An Open-Source Interactive Multimodal Language Model

VITA is a pioneering project that introduces the world to the first-ever open-source Multimodal Large Language Model (MLLM) with the ability to handle and interact with four different types of input—video, image, text, and audio—simultaneously. This project aims to bring advanced interactive experiences and multimodal capabilities typically associated with models like GPT-4 into the open-source domain. Here’s a comprehensive overview of what VITA offers:

VITA Overview

VITA stands out for its ability to seamlessly integrate and process data across multiple modalities, making it versatile for practical applications. The key features that define VITA include:

Omni Multimodal Understanding: VITA excels in understanding and processing inputs in different forms—language, vision, and audio—proving its strong capability in both unimodal and multimodal benchmarks.
Non-awakening Interaction: Unlike other models, VITA does not require a wake-up command to begin interaction. It can respond directly to user questions based on audio alone.
Audio Interrupt Interaction: This feature allows real-time tracking and filtering of queries. Users can interrupt the model’s response anytime with new questions, and VITA will adapt accordingly.

These capabilities enable VITA to handle pure text or audio inputs and combinations of video or images placed with text or audio, offering a comprehensive interactive user experience.

Innovative Techniques

VITA employs cutting-edge techniques to enhance its interactive experience:

State Token: Different tokens represent different input types in VITA, allowing it to distinguish between effective queries, irrelevant noise, and text queries. In training, this differentiation is taught to the model, enabling seamless interaction without activation commands.
Duplex Scheme: In this setup, two models function concurrently—one handles user queries while the other monitors ongoing conditions. It facilitates a fluid switch between roles during user interactions, allowing continuous and adaptive query handling.

Experimental Results

VITA has undergone extensive testing to ensure its capabilities. The evaluations include:

Language evaluation comparing the Mixtral 8x7B version with official standards.
Analysis of the error rate in Automatic Speech Recognition (ASR) tasks.
Assessment of capabilities in understanding images and videos.

Training and Installation

For those interested in deploying VITA, the training procedure involves setting up a Python environment, preparing data, and proceeding with continual training using specific model weights and configurations.

Requirements and Installation: Clone the repository, set up a virtual environment, and install necessary dependencies.
Data Preparation: Organize training data with specific image, audio, and text inputs, configuring paths accordingly.
Continual Training: Download necessary model weights and adjust scripts for continued model training.

Inference and Demonstrations

VITA offers various modes for interacting with different types of inputs:

Quick Start: Test VITA with text and audio queries using simple command-line executions.
Basic and Real-Time Interactive Demos: Demonstrations showcase VITA’s interactive capabilities, including real-time reactions and seamless switching between tasks.

Citation and Collaboration

Researchers and developers are encouraged to cite VITA in their work and explore related research projects, enhancing the collaborative development of MLLMs.

Acknowledgements

VITA’s development credits outstanding projects like LLaVA-1.5, Bunny, and others, reflecting a collaborative spirit in advancing MLLMs.

VITA represents a significant step forward in making advanced interactive technologies accessible to developers and researchers worldwide, contributing to open-source innovation.