evalscope - Holistic Evaluation and Benchmarking Framework for AI Models

Introduction to EvalScope

EvalScope is an official model evaluation and performance benchmarking framework introduced by the ModelScope community. It is designed to streamline the process of assessing the capabilities of various models, including large language models (LLMs), multimodal models, embedding models, and rerankers. By providing built-in benchmarks, EvalScope ensures that users can easily evaluate models against popular datasets like MMLU, CMMLU, C-Eval, GSM8K, ARC, and more.

EvalScope's adaptability shines in its ability to handle different evaluation scenarios, such as end-to-end Retrieval-Augmented Generation (RAG) evaluations, performance stress testing, and arena mode for model comparisons. With its integration with ms-swift, a training platform, users can seamlessly transition from training to evaluation with just a click.

Key Features and Components

EvalScope is built upon several core modules that work cohesively to deliver a comprehensive evaluation experience:

Model Adapter: Converts outputs from different models into the needed format, supporting models accessed via APIs and those running locally.
Data Adapter: Prepares and converts input data to meet various evaluation requirements.
Evaluation Backend:
- Native: EvalScope’s intrinsic evaluation framework supporting modes like single model evaluation and arena mode.
- OpenCompass: A framework that simplifies the submission of evaluation tasks.
- VLMEvalKit: Facilitates easy multi-modal evaluation task initiation.
- RAGEval: Handles RAG evaluations for embedding models and rerankers.
- Third-party: Supports additional tools, such as ToolBench.
Performance Evaluator: Focuses on assessing model inference performance, including stress testing and performance visualization.
Evaluation Report: Summarizes the performance outcomes, aiding in decision-making and model refinement.
Visualization: Presents evaluation results clearly to aid users in analyzing and comparing model performances.

Installation and Getting Started

Installation

EvalScope can be installed via pip, with options to include various backends:

pip install evalscope             # Default installation
pip install evalscope[opencompass] # With OpenCompass support
pip install evalscope[vlmeval]    # With VLMEvalKit support

Users can also directly clone the repository and install from source if preferred.

Quick Start

To get started, users can perform a simple evaluation using EvalScope with just a few commands. This can either be done from a pip installation or from the source directory. Parameters like models and datasets can be specified to fine-tune the evaluation processes.

Advanced Features

EvalScope offers advanced functionalities allowing parameter customization for tailored evaluations, offline modes to work without internet, and arena mode for pairwise model comparisons. Additionally, it provides specialized tools for model serving performance evaluations to test large language models under stress.

Community and Documentation

EvalScope’s documentation is kept up to date, with news and technical discussions introduced regularly. The project has a roadmap featuring goals like multi-modal evaluation, distributed evaluations, and additional benchmarks to continue expanding its capabilities.

To support the community, a leaderboard is available to benchmark models, providing researchers and developers with insights into the performance of different models on various tasks.

EvalScope is an evolving framework poised to remain a valuable tool for model evaluation and benchmarking within the AI community.