ChatEval by thunlp

LLM evaluation framework using multi-agent debate

Created 2 years ago

322 stars

Top 84.5% on SourcePulse

View on GitHub

1 Expert Loves This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Project Summary

ChatEval provides a framework for automated evaluation of LLM-generated text using a multi-agent debate mechanism. It is designed for researchers and developers seeking to benchmark and improve LLM outputs by simulating human-like evaluation through LLM-driven discussions and judgments, offering a more nuanced and transparent alternative to direct scoring.

How It Works

ChatEval leverages a multi-agent system where LLMs, assigned specific roles and personas, debate the merits of different text responses. This debate process, inspired by human evaluation, aims to produce more informed and less biased judgments. The system orchestrates these debates, allowing LLMs to analyze, compare, and score responses based on predefined criteria and their assigned characteristics.

Quick Start & Requirements

Install dependencies: pip install -r requirements.txt
Set OpenAI API key: export OPENAI_API_KEY="your_api_key_here"
For arena-style demo, install FastChat: cd ChatEval/FastChat && pip3 install -e ".[model_worker,webui]"
Run evaluation: python llm_eval.py --config agentverse/tasks/llm_eval/config.yaml
Requires Python 3.x and an OpenAI API key.
Demo setup involves running FastChat controller, model workers, and Gradio server.
Official Demo: Video Demo

Highlighted Details

Simulates LLM-based evaluators through multi-agent debate.
Built upon the FastChat framework for demo functionality.
Allows customization of agent roles, prompts, and evaluation logic via YAML configuration.
Outputs detailed evaluation evidence and scores for each response.

Maintenance & Community

Project associated with the paper "ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate".
Primary contributors appear to be affiliated with Tsinghua University.
Citation details provided in BibTeX format.

Licensing & Compatibility

The repository itself is not explicitly licensed in the README.
Dependencies like FastChat may have their own licenses.
Usage of OpenAI API is subject to OpenAI's terms of service.

Limitations & Caveats

The system relies heavily on OpenAI's API, incurring costs and requiring an API key. The effectiveness of the evaluation is dependent on the quality of the LLM agents and the prompt engineering used to define their roles and debate logic. The README does not specify licensing for the ChatEval code itself.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

9 stars in the last 30 days