promptbench - Comprehensive Evaluation Tools for Large Language Models

Project Introduction: PromptBench

PromptBench is a powerful and user-friendly Python package designed for evaluating large language models (LLMs). Built with the Pytorch library, it offers a unified and efficient approach for researchers looking to assess and understand the complexities of LLMs. This comprehensive tool provides nuanced insight through various evaluation techniques and robust model examination.

Key Features

1. Quick Model Assessment:
PromptBench enables users to quickly assess model performance through its intuitive interface. It allows for seamless model setup, data loading, and performance evaluation.

2. Prompt Engineering Techniques:
Several prompt engineering methods are incorporated into PromptBench. These include innovative approaches like Few-shot Chain-of-Thought, Emotion Prompt, and Expert Prompting, ensuring a diverse range of evaluation techniques.

3. Adversarial Prompt Evaluation:
With the integration of adversarial prompt attacks, researchers can simulate challenging scenarios to test model robustness against real-world adversarial threats.

4. Dynamic Evaluation:
The dynamic evaluation framework, DyVal, allows for on-the-fly sample generation with controlled complexity, thereby reducing potential test data contamination.

5. Efficient Multi-Prompt Evaluation:
Through the efficient multi-prompt method, PromptEval, it fosters a predictive model that estimates LLM performance on unseen data by utilizing limited existing data—a significant advantage for project efficiency.

Installation Guide

Install via pip: To quickly get started, PromptBench can be installed using pip with the following command:

pip install promptbench

Install via GitHub: For users interested in utilizing the latest features, PromptBench can be installed directly from GitHub:

git clone [email protected]:microsoft/promptbench.git
cd promptbench
conda create --name promptbench python=3.9
conda activate promptbench
pip install -r requirements.txt

Usage Overview

PromptBench is designed to be both easy to use and extendable. Researchers can efficiently create evaluation pipelines for existing models and datasets or develop new components. The package supports a range of tutorials for various functions, including evaluating models, testing different prompts, and examining prompt robustness.

Supported Components

Datasets:
PromptBench supports a variety of datasets including language datasets like GLUE and MMLU, as well as multi-modal datasets such as VQAv2 and ScienceQA.

Models:
A comprehensive selection of open-source and proprietary models are supported. These range from Google/flan-t5-large to GPT-4 and multi-modal models like BLIP2.

Prompt Engineering:
PromptBench includes methods like Chain-of-Thought, Emotion Prompt, and Expert Prompting, catering to diverse research needs.

Adversarial Attacks:
The package features several attack types, from character-level to semantic-level, demonstrating its robustness testing capabilities.

Benchmark Results

A thorough evaluation of model performance is accessible through the PromptBench benchmark website, offering results on various tests including adversarial attack resilience and evaluation robustness.

Community and Contributions

PromptBench is open to community contributions and improvements. It follows the Microsoft Open Source Code of Conduct and encourages developers to share enhancements and suggestions through pull requests and issue tracking.

In summary, PromptBench serves as a valuable resource for researchers and developers seeking to evaluate and understand large language models comprehensively. Its extensive features and flexible design make it an excellent choice for both existing and new projects.