MMMU
MMMU and MMMU-Pro provide robust benchmarks essential for evaluating multimodal models, focusing on tasks that require college-level knowledge and reasoning across diverse disciplines. These benchmarks consist of 11.5K questions covering six core fields, designed to test the models' proficiency in integrating visual and textual information. Specifically, MMMU-Pro introduces more stringent assessments by implementing vision-only input and expanding candidate options. Despite the challenging nature of these tasks, even advanced models like GPT-4V achieve only moderate accuracy, indicating significant potential for improvement in multimodal AI, moving towards expert AGI. For detailed evaluations, visit EvalAI, and access the datasets through Hugging Face.