Mantis - Advanced Multi-Image Processing with LLaMA-3 Technology

Mantis: Multi-Image Instruction Tuning

Mantis is an innovative project designed to address the limitations of existing multimodal models that primarily focus on single-image tasks. As a cutting-edge Large Multimodal Model (LMM), Mantis sets itself apart by excelling in multi-image visual language tasks. Unlike other models that rely heavily on pre-training with massive noisy datasets, Mantis employs a more efficient and effective approach.

The Problem and Solution

In recent years, there has been significant progress in developing large multimodal models for single-image tasks. However, these models struggle when it comes to multi-image visual language tasks. Current solutions, such as OpenFlamingo and Emu, depend on extensive pre-training with noisy data from the web, which is not always efficient or effective.

Mantis, on the other hand, is built on the LLaMA-3 framework and utilizes interleaved text and image inputs. It is trained on the Mantis-Instruct dataset, leveraging computational resources effectively to achieve high performance in both multi-image and single-image tasks.

Achievements

Mantis demonstrates state-of-the-art results across five multi-image benchmarks: NLVR2, Q-Bench, BLINK, MVBench, and Mantis-Eval. It maintains robust performance comparable to leading models like CogVLM and Emu2, emphasizing its competency in single-image tasks as well.

Key Developments

A noteworthy aspect of Mantis is its efficient training process, utilizing only 36 hours on a set of 16xA100-40G GPUs. Recent updates have included:

Support for training with the Idefics-3 model.
Integration with VLMEvalKit for model evaluation.
Release of training curves for various Mantis models, aiding in experiment reproducibility.
Establishment of a SoTA Mantis-8B-Idefics2 model.

Installation and Usage

Setting up Mantis involves creating a new environment and installing necessary packages, including flash-attention. The inference process is straightforward, allowing users to experiment with Mantis through pre-configured scripts and examples.

Training and Evaluation

Mantis offers flexible training scripts compatible with different datasets and models. The training process involves two stages: pre-training the multimodal projector, followed by fine-tuning on the Mantis-Instruct dataset. The project provides comprehensive training examples and supports the integration of various architectures, ensuring versatility in model training.

Evaluation scripts are also provided, allowing for comprehensive benchmarking of Mantis's capabilities.

Data and Model Availability

The Mantis project has made available several datasets and models through the Hugging Face platform. This includes:

The Mantis-Instruct dataset, consisting of 721K text-image pairs.
The Mantis-Eval dataset for evaluating multi-image skills.
A collection of well-documented models like Mantis-8B-Idefics2 and others, ready for use and experimentation.

Conclusion

Mantis stands as a significant advancement in the field of multimodal models, particularly excelling in multi-image tasks. Its efficient training, robust performance, and extensive support for various models and architectures make it a valuable tool for researchers and developers working on visual language models. With ongoing updates and community support, Mantis is set to push the boundaries of what multimodal models can achieve in both academic and practical applications.