LLM evaluation framework using multi-agent debate
Top 90.9% on sourcepulse
ChatEval provides a framework for automated evaluation of LLM-generated text using a multi-agent debate mechanism. It is designed for researchers and developers seeking to benchmark and improve LLM outputs by simulating human-like evaluation through LLM-driven discussions and judgments, offering a more nuanced and transparent alternative to direct scoring.
How It Works
ChatEval leverages a multi-agent system where LLMs, assigned specific roles and personas, debate the merits of different text responses. This debate process, inspired by human evaluation, aims to produce more informed and less biased judgments. The system orchestrates these debates, allowing LLMs to analyze, compare, and score responses based on predefined criteria and their assigned characteristics.
Quick Start & Requirements
pip install -r requirements.txt
export OPENAI_API_KEY="your_api_key_here"
cd ChatEval/FastChat && pip3 install -e ".[model_worker,webui]"
python llm_eval.py --config agentverse/tasks/llm_eval/config.yaml
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The system relies heavily on OpenAI's API, incurring costs and requiring an API key. The effectiveness of the evaluation is dependent on the quality of the LLM agents and the prompt engineering used to define their roles and debate logic. The README does not specify licensing for the ChatEval code itself.
9 months ago
1 week