MMMU - Enhancing Multimodal Understanding and Reasoning Across Varied Disciplines

Introduction to the MMMU Project

The MMMU Benchmark

The MMMU Benchmark is an innovative project crafted to test the skills of multimodal models. These are models capable of processing and understanding information from multiple sources such as images, text, and more. The benchmark is designed to evaluate these models on extensive, multi-discipline tasks that require knowledge at the college level and intricate reasoning. It encompasses an impressive collection of 11.5K detailed questions sourced from college exams, quizzes, and textbooks, focusing on six primary disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Technology & Engineering. These questions cover 30 subjects and 183 subfields, featuring 32 diverse image types, including diagrams, maps, tables, and more. Unlike previous benchmarks, MMMU emphasizes advanced perception and reasoning using domain-specific knowledge, posing challenges comparable to what human experts face.

Notably, the team tested 14 open-source Large Multimodal Models (LMMs) and the proprietary GPT-4 Vision model, discovering that even advanced models like GPT-4 Vision achieved only 56% accuracy. This indicates a significant opportunity for improvement. The MMMU aims to motivate the research community to develop next-generation multimodal foundational models, moving closer to expert-level Artificial General Intelligence (AGI).

MMMU Benchmark

MMMU-Pro: Enhanced Evaluation

Building upon the MMMU Benchmark, MMMU-Pro offers a more robust framework for assessing the understanding and reasoning capabilities of multimodal models. This enhanced version employs a rigorous three-step process:

Filtering Questions: It removes questions that can be answered with text alone, ensuring the emphasis is on multimodal comprehension.
Augmenting Answer Choices: It introduces more plausible options, making the challenge tougher.
Vision-only Input: Questions are embedded within images, encouraging AI to simultaneously "see" and "read," thereby imitating a fundamental human cognitive skill.

Results show that models perform significantly lower on MMMU-Pro than on MMMU, with accuracy ranging between 16.8% and 26.9%. This version provides an even more challenging evaluation setting, closely resembling real-world scenarios. It offers crucial insights for progressing multimodal AI research.

MMMU-Pro Benchmark

Dataset Creation

Both MMMU and MMMU-Pro were designed with meticulous attention to challenge and evaluate multimodal models. These tasks require a high level of subject knowledge and complex reasoning akin to college-level exams. More detailed dataset information is available on Hugging Face:

Evaluation Process

Detailed resources and instructions for evaluating models using both benchmarks can be found in the respective evaluation sections:

For MMMU, a full suite of questions is available, with a development set for experimentation and a validation set for better refining models. The actual test set's answers are withheld, but predictions can be submitted through EvalAI.

Important Notices

During the preparation of MMMU and MMMU-Pro, strict guidelines were followed to adhere to copyright and licensing rules. If there are any concerns regarding potential violations, users are encouraged to contact the team directly for prompt action.

Contact Information

For queries or more information, interested individuals can reach out to the following contacts:

Xiang Yue: [email protected]
Yu Su: [email protected]
Wenhu Chen: [email protected]

The MMMU project is an ambitious step towards developing sophisticated multimodal models, driving towards a future where AI can achieve expert-level reasoning across various disciplines.