CharacterEval  by morecry

Chinese benchmark for role-playing conversational agent evaluation

created 1 year ago
258 stars

Top 98.6% on sourcepulse

GitHubView on GitHub
Project Summary

CharacterEval is a benchmark dataset and evaluation framework for Chinese role-playing conversational agents (RPCAs). It addresses the need for specialized evaluation in Chinese RP scenarios, offering a comprehensive dataset of dialogues, character profiles, and a novel reward model for assessing agent performance. The target audience includes researchers and developers working on large language models and conversational AI, aiming to improve the quality and character consistency of RPCAs.

How It Works

CharacterEval utilizes a dataset of 1,785 multi-turn dialogues and 23,020 examples across 77 characters from Chinese literature and scripts. It incorporates detailed character profiles from Baidu Baike and employs a multifaceted evaluation approach with thirteen targeted metrics across four dimensions. A key innovation is the CharacterRM, a character-based reward model trained on manual annotations, which reportedly achieves higher correlation with human judgment than GPT-4 for evaluating RP agent responses.

Quick Start & Requirements

  • Install: pip install -r requirements.txt
  • Prerequisites: CUDA (implied by CUDA_VISIBLE_DEVICES=0), Python.
  • Resources: Requires downloading model checkpoints from Hugging Face. Intermediate results for five open-source models are provided.
  • Links: Paper: https://arxiv.org/abs/2401.01275

Highlighted Details

  • Dataset includes 1,785 dialogues and 23,020 examples across 77 characters.
  • CharacterRM reward model shows superior correlation with human evaluation compared to GPT-4.
  • Evaluation uses thirteen metrics across four dimensions.
  • Provides scripts for response generation, format transformation, reward model evaluation, and score computation.

Maintenance & Community

  • Recent updates include uploading reward model training data and publishing manual annotation guidelines.
  • No explicit community links (Discord, Slack) or roadmap are mentioned in the README.

Licensing & Compatibility

  • The README does not explicitly state a license. The provided code and data should be checked for licensing terms before commercial use or integration into closed-source projects.

Limitations & Caveats

The README does not specify the license for the dataset or code, which may impact commercial use. The evaluation process requires specific data transformations and the use of provided scripts, implying a degree of complexity in integration.

Health Check
Last commit

2 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
24 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.