CharacterEval by morecry

Chinese benchmark for role-playing conversational agent evaluation

Created 2 years ago

286 stars

Top 91.8% on SourcePulse

Project Summary

CharacterEval is a benchmark dataset and evaluation framework for Chinese role-playing conversational agents (RPCAs). It addresses the need for specialized evaluation in Chinese RP scenarios, offering a comprehensive dataset of dialogues, character profiles, and a novel reward model for assessing agent performance. The target audience includes researchers and developers working on large language models and conversational AI, aiming to improve the quality and character consistency of RPCAs.

How It Works

CharacterEval utilizes a dataset of 1,785 multi-turn dialogues and 23,020 examples across 77 characters from Chinese literature and scripts. It incorporates detailed character profiles from Baidu Baike and employs a multifaceted evaluation approach with thirteen targeted metrics across four dimensions. A key innovation is the CharacterRM, a character-based reward model trained on manual annotations, which reportedly achieves higher correlation with human judgment than GPT-4 for evaluating RP agent responses.

Quick Start & Requirements

Install: pip install -r requirements.txt
Prerequisites: CUDA (implied by CUDA_VISIBLE_DEVICES=0), Python.
Resources: Requires downloading model checkpoints from Hugging Face. Intermediate results for five open-source models are provided.
Links: Paper: https://arxiv.org/abs/2401.01275

Highlighted Details

Dataset includes 1,785 dialogues and 23,020 examples across 77 characters.
CharacterRM reward model shows superior correlation with human evaluation compared to GPT-4.
Evaluation uses thirteen metrics across four dimensions.
Provides scripts for response generation, format transformation, reward model evaluation, and score computation.

Maintenance & Community

Recent updates include uploading reward model training data and publishing manual annotation guidelines.
No explicit community links (Discord, Slack) or roadmap are mentioned in the README.

Licensing & Compatibility

The README does not explicitly state a license. The provided code and data should be checked for licensing terms before commercial use or integration into closed-source projects.

Limitations & Caveats

The README does not specify the license for the dataset or code, which may impact commercial use. The evaluation process requires specific data transformations and the use of provided scripts, implying a degree of complexity in integration.

CharacterEval by morecry

Explore Similar Projects

OrionStar-Yi-34B-Chat by OrionStarAI

CharacterGLM-6B by thu-coai

awesome-japanese-llm by llm-jp

z-bench by zhenbench

Index-1.9B by bilibili

dialogbot by shibing624

huanhuan-chat by KMnO4-zx

EmpatheticDialogues by facebookresearch

UltraChat by thunlp

awesome-chatgpt by OpenMindClub

CDial-GPT by thu-coai

feishu-openai by ConnectAI-E