VTimeLLM - Enhancing Video Understanding with a Three-Stage Temporal Reasoning Approach

VTimeLLM: A Video Language Learning Model Overview

VTimeLLM, short for "Empower LLM to Grasp Video Moments," is an advanced video language model designed to understand and reason video moments precisely, especially concerning their time boundaries. The official implementation of this project is built using PyTorch, and it has been documented in a comprehensive research paper available on arXiv.

Latest Updates

The project has seen several exciting updates:

January 2nd: The code has been refactored to support both LLAMA and ChatGLM3 architectures. Moreover, the training data has been translated into Chinese, and a fine-tuned Chinese version is now available.
December 14th: Release of the training code and data, alongside accessible resources including models and datasets.
December 4th: Launch of the VTimeLLM demo.

Project Overview

VTimeLLM differentiates itself through its ability to handle fine-grained video moment understanding. It uses a boundary-aware three-stage training strategy:

Feature Alignment: Utilizes image-text pairings.
Temporal Boundary Awareness: Engages multiple-event videos to heighten temporal sensitivity.
Instruction Tuning: Refines video-instruction tuning to enhance human-intent alignment and temporal understanding.

framework

Key Contributions

Introduction of VTimeLLM: It is the first boundary-aware video language learning model (as of the latest research).
Novel Training Strategy: The project utilizes a unique three-stage boundary-aware training approach leveraging large-scale datasets and dialogue interactions.
Extensive Testing: VTimeLLM undergoes rigorous testing, surpassing existing models in various video tasks requiring detailed temporal reasoning and understanding.

Installation

Setting up VTimeLLM involves creating an environment with a specific setup using the following commands:

conda create --name=vtimellm python=3.10
conda activate vtimellm
git clone https://github.com/huangb23/VTimeLLM.git
cd VTimeLLM
pip install -r requirements.txt

Additional installations for advanced training include:

pip install ninja
pip install flash-attn --no-build-isolation

Usage and Demonstrations

For users interested in offline demos or running training sessions, guidance can be found in the provided documentation: offline_demo.md and train.md.

Evaluation and Analysis

VTimeLLM's performance is showcased through comprehensive qualitative analysis:

Video Understanding and Conversational Tasks
Creative Tasks
Fine-grained Understanding Tasks
Video Reasoning Tasks

Each task category is accompanied by visual examples to illustrate VTimeLLM's capabilities.

Acknowledgements

The development of VTimeLLM has been supported by several influential projects:

LLaVA: A large language and vision assistant.
FastChat: A comprehensive platform for large language model-based chatbots.
Video-ChatGPT: A project focused on detailed video understanding.
LLaMA: Efficient foundational language models.
Vid2seq: Pretraining a visual language model for video captioning.
InternVid: A significant video-text dataset.

License and Citation

VTimeLLM is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License. Researchers and practitioners using this resource are encouraged to cite it using the provided BibTeX entry.

Looking forward to engaging with the community for feedback and contributions!