Chinese benchmark for role-playing conversational agent evaluation
Top 98.6% on sourcepulse
CharacterEval is a benchmark dataset and evaluation framework for Chinese role-playing conversational agents (RPCAs). It addresses the need for specialized evaluation in Chinese RP scenarios, offering a comprehensive dataset of dialogues, character profiles, and a novel reward model for assessing agent performance. The target audience includes researchers and developers working on large language models and conversational AI, aiming to improve the quality and character consistency of RPCAs.
How It Works
CharacterEval utilizes a dataset of 1,785 multi-turn dialogues and 23,020 examples across 77 characters from Chinese literature and scripts. It incorporates detailed character profiles from Baidu Baike and employs a multifaceted evaluation approach with thirteen targeted metrics across four dimensions. A key innovation is the CharacterRM, a character-based reward model trained on manual annotations, which reportedly achieves higher correlation with human judgment than GPT-4 for evaluating RP agent responses.
Quick Start & Requirements
pip install -r requirements.txt
CUDA_VISIBLE_DEVICES=0
), Python.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The README does not specify the license for the dataset or code, which may impact commercial use. The evaluation process requires specific data transformations and the use of provided scripts, implying a degree of complexity in integration.
2 months ago
1 week