Multi-Modality-Arena - Holistic Evaluation System for Multimodal Model Capabilities

Multi-Modality Arena: Exploring the Future of Multimodal Models

The Multi-Modality Arena is a cutting-edge evaluation platform designed to test and refine large-scale multimodal models. Inspired by the Fastchat initiative, it allows for the comparison of two anonymous models in a visual question-answering task, providing a unique space for advancement and innovation in artificial intelligence technology. The demo of this platform is available here, inviting global participation.

Comprehensive Evaluation of Multimodal Models

The platform offers a holistic evaluation through its benchmarks and datasets that cover various aspects and application areas:

OmniMedVQA: Benchmark for Medical Models

OmniMedVQA Dataset: This dataset is a large compilation of 118,010 images accompanied by 127,995 question-answer items. It covers 12 modalities across over 20 human anatomical regions, which is ideal for testing medical vision-language models.
Model Diversity: It includes evaluations for 12 models - 8 general-purpose and 4 specialized medical LVLMs.

Tiny LVLM-eHub: Early Experimentation

Tiny Datasets: This sub-component offers datasets with only 50 samples each for straightforward use, totaling 42 text-related visual benchmarks and 2,100 samples.
Model Evaluation: Incorporates an additional 4 models, including Google Bard, raising the total to 12 models. The ChatGPT Ensemble Evaluation method used here aligns better with human assessments than previous methods.

LVLM-eHub: Core Benchmark for Large Vision-Language Models

The LVLM-eHub serves as a detailed evaluation benchmark specifically for multimodal models. It examines 8 LVLMs across 6 multimodal capabilities through 47 datasets and one online platform, allowing for robust assessments.

LVLM Leaderboard: Tracking Performance

Reflecting diverse datasets and specific multimodal capabilities like visual reasoning and commonsense, the LVLM Leaderboard ranks models based on their performance. Featured models range from InternVL to Otter, emphasizing their strengths, and improving users' model selection out of a comprehensive list of candidates.

Regular Updates and Community Engagement

Multi-Modality Arena is dynamic, with regular updates such as the release of the OmniMedVQA and tiny LVLM-eHub resources. The platform encourages community involvement through contributions, feedback, and engagement, assisting the advancement of multimodal evaluations.

Launching the Demo Platform

The system setup involves a straightforward process using widely available software tools:

Set up the Environment: Install and configure a Conda environment with necessary packages.
Start Services: Run specific Python scripts to launch the controller, model workers, and Gradio web server.
Interact with Models: The setup enables web interface interaction, facilitating easy access and testing for developers and researchers.

Contribution and Collaboration

Users are encouraged to enhance the LVLM evaluation quality by engaging with available datasets and contributing to the platform's growth. Instructions for integrating models using a tester script are provided, along with an invitation to share findings or model inference APIs for further development.

Supported Models and Contact

The platform supports a variety of multimodal models from different research institutes and companies, ensuring a broad range of experiments. Users can join discussions and collaborations via WeChat, making it a collaborative hub for AI researchers.

Acknowledgements and Usage

The project recognizes the contributions of ChatBot Arena and LVLM providers and emphasizes its role as a non-commercial research tool. It sets guidelines for ethical use, ensuring content generation remains appropriate and legal.

The Multi-Modality Arena stands as a pivotal platform for evaluating and advancing multimodal model capabilities, continuing to evolve alongside community and technological contributions.