LLaVA-NeXT
LLaVA-NeXT introduces significant advancements in multimodal models, particularly in video processing with the LLaVA-Video-178K dataset. This synthetic dataset significantly enhances video instruction tuning, comprising 178,510 captions and extensive Q&A pairs, alongside the LLaVA-Video 7B/72B models to improve video benchmarks. The project focuses on multi-image, video, and 3D task innovations and promotes thorough model evaluations and efficient task transfer techniques.