ChatEval  by thunlp

LLM evaluation framework using multi-agent debate

created 2 years ago
294 stars

Top 90.9% on sourcepulse

GitHubView on GitHub
Project Summary

ChatEval provides a framework for automated evaluation of LLM-generated text using a multi-agent debate mechanism. It is designed for researchers and developers seeking to benchmark and improve LLM outputs by simulating human-like evaluation through LLM-driven discussions and judgments, offering a more nuanced and transparent alternative to direct scoring.

How It Works

ChatEval leverages a multi-agent system where LLMs, assigned specific roles and personas, debate the merits of different text responses. This debate process, inspired by human evaluation, aims to produce more informed and less biased judgments. The system orchestrates these debates, allowing LLMs to analyze, compare, and score responses based on predefined criteria and their assigned characteristics.

Quick Start & Requirements

  • Install dependencies: pip install -r requirements.txt
  • Set OpenAI API key: export OPENAI_API_KEY="your_api_key_here"
  • For arena-style demo, install FastChat: cd ChatEval/FastChat && pip3 install -e ".[model_worker,webui]"
  • Run evaluation: python llm_eval.py --config agentverse/tasks/llm_eval/config.yaml
  • Requires Python 3.x and an OpenAI API key.
  • Demo setup involves running FastChat controller, model workers, and Gradio server.
  • Official Demo: Video Demo

Highlighted Details

  • Simulates LLM-based evaluators through multi-agent debate.
  • Built upon the FastChat framework for demo functionality.
  • Allows customization of agent roles, prompts, and evaluation logic via YAML configuration.
  • Outputs detailed evaluation evidence and scores for each response.

Maintenance & Community

  • Project associated with the paper "ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate".
  • Primary contributors appear to be affiliated with Tsinghua University.
  • Citation details provided in BibTeX format.

Licensing & Compatibility

  • The repository itself is not explicitly licensed in the README.
  • Dependencies like FastChat may have their own licenses.
  • Usage of OpenAI API is subject to OpenAI's terms of service.

Limitations & Caveats

The system relies heavily on OpenAI's API, incurring costs and requiring an API key. The effectiveness of the evaluation is dependent on the quality of the LLM agents and the prompt engineering used to define their roles and debate logic. The README does not specify licensing for the ChatEval code itself.

Health Check
Last commit

9 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
16 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.