CharacterEval: A Chinese Benchmark for Role-Playing Conversational Agent Evaluation
CharacterEval is a specialized benchmark tailored for gauging the effectiveness of Role-Playing Conversational Agents (RPCAs) in the context of Chinese dialogues. This project is extensive in scope, featuring 1,785 multi-turn conversations derived from 77 characters found in Chinese novels and screenplays, resulting in a rich dataset of 23,020 examples. Each character is meticulously profiled using information sourced from Baidu Baike, adding depth and authenticity to the dialogues utilized in the evaluations.
To ensure comprehensive assessment, CharacterEval uses a robust evaluation framework split into four primary dimensions, further divided into thirteen specific metrics. This nuanced approach ensures that each conversational agent is scrutinized in multiple facets, providing a comprehensive evaluation of their role-playing capabilities.
Latest News
As of May 31, 2024, the team has published a detailed document for manual annotation titled "Predefined Annotated Examples of CharacterEval.pdf." This document outlines the evaluation dimensions and provides two exemplar cases—one with a top score and one with a bottom score. These examples serve as guidelines for manual annotators and are also utilized in training GPT-4 for in-context learning. In total, 12 annotators were recruited, split into two groups, each assessing the conversations for consistency. Any discrepancies in scoring are reconciled through discussions.
Installation and Evaluation Process
Setting up CharacterEval is straightforward. Start by installing the necessary dependencies:
pip install -r requirements.txt
Generating Responses
To generate responses, a script is available that leverages ChatGLM3 to produce replies using the provided context and character profiles. This script can be tailored to fit the input specifications of your particular model:
CUDA_VISIBLE_DEVICES=0 python get_response.py
Transforming the Format
CharacterEval employs a sparse evaluation method, meaning each example is rated on a subset of subjective metrics to achieve more nuanced results. Therefore, transforming the format is necessary for evaluating the reward model:
python transform_format.py
Running CharacterRM
Download the BaichuanCharRM to evaluate the generated responses using the Character Reward Model:
CUDA_VISIBLE_DEVICES=0 python run_char_rm.py
Computing Evaluation Scores
Finally, compute the average evaluation scores for each metric:
python compute_score.py
Intermediate Results and Models
For those interested in reproducing the results, intermediate outputs are available for several open-source models under the results/
directory. This includes ChatGLM-6B, Baichuan-7B-Chat, XVERSE-7B-Chat, InternLM-7B-Chat, and Qwen-7B-Chat. All model checkpoints are accessible via Huggingface.
CharacterEval stands as an integral tool for the advancement and evaluation of conversational agents, offering a multifaceted framework rooted in rich cultural content. Its detailed character profiles and refined evaluation metrics make it a cornerstone for those looking to enhance RPCA performance.