MGM - Utilizing Dense Multi-Modal Language Models for Integrated Image Understanding and Generation

Project Overview: Mini-Gemini

Mini-Gemini is a cutting-edge research project that explores the abilities of multimodal vision language models. It delves into understanding, reasoning, and generating content using a combination of visual and textual data. Built on the foundation of the LLaVA framework, this project supports various dense and mixture of experts (MoE) large language models, ranging from 2 billion to 34 billion parameters. Its comprehensive approach allows it to simultaneously manage tasks involving image understanding, reasoning, and content generation.

Key Releases

05/03: Introduction of models based on LLaMA3, available for exploration on Hugging Face.
04/15: Launch of a demo on Hugging Face featuring the MGM 13B-HD model.
03/28: Official release of the Mini-Gemini project, along with the research paper, demo, code, models, and datasets.

Demo and Installation

Mini-Gemini offers an online demo for users to test its capabilities. Users can easily set up the environment by cloning the GitHub repository and installing dependencies. Special instructions are provided for setting up version-specific packages and additional training tools like ninja and flash-attn.

Model Framework

The Mini-Gemini framework is structured around dual vision encoders. These encoders handle visual inputs at varying resolutions, providing a low-res visual embedding and a set of high-res candidates. Patch info mining bridges the gap between these resolutions, allowing detailed and broader visual queries to be processed simultaneously. This setup enables the seamless integration of text and image data for complete understanding and generation.

Several models are available, each fine-tuned with specific datasets and schedules. Models vary in size and configuration, offering a range of possibilities for different use cases and research explorations.

Training and Evaluation

Training involves two main stages:

Feature Alignment - Bridging the vision and language tokens.
Instruction Tuning - Preparing the model to follow multimodal instructions.

The project provides paid attention to making sure the data is well-organized into MGM-Pretrain, MGM-Finetune, and MGM-Eval subsets. Evaluation is performed across multiple image-based benchmarks, allowing for a comprehensive understanding of the model's capabilities.

Mini-Gemini is ready to explore complex tasks thanks to its solid groundwork and extensive training. It showcases potential in various real-world applications involving image and language comprehension.

For those interested in the technical details, datasets and pretrained weights can be downloaded and organized per project guidelines. These resources support the training and evaluation of models within the Mini-Gemini framework.

In summary, Mini-Gemini represents a significant advancement in the field of multimodal Large Language Models. Its integration of cutting-edge technology and thorough approach allows users to delve deep into multimodal understanding and generation.