Awesome-LLMs-as-Judges  by CSHaitao

Survey paper for LLM-based evaluation methods

Created 10 months ago
448 stars

Top 67.2% on SourcePulse

GitHubView on GitHub
Project Summary

This repository serves as a comprehensive survey and resource hub for "LLMs-as-Judges," a rapidly evolving field where Large Language Models are employed for evaluation tasks across various domains like text generation, question answering, and dialogue systems. It targets researchers, developers, and practitioners seeking to understand and leverage LLM-based evaluation methods for model assessment and enhancement.

How It Works

The project categorizes LLM-as-a-Judge methodologies into single-LLM systems (prompt-based, tuning-based, post-processing) and multi-LLM systems (communication, aggregation), alongside human-AI collaboration. It details applications across general text, multimodal, medical, legal, financial, and educational domains, and critically examines meta-evaluation benchmarks, metrics, limitations, biases, and adversarial attacks.

Quick Start & Requirements

This repository is primarily a curated list of papers and research. There are no direct installation or execution commands provided. Access to the papers requires standard academic research access.

Highlighted Details

  • Comprehensive categorization of LLM-as-a-Judge methodologies, applications, and evaluation benchmarks.
  • Detailed analysis of limitations, including various biases (presentation, social, content, cognitive) and adversarial attack vectors.
  • Regularly updated with daily arXiv papers and conference proceedings related to LLMs-as-Judges.
  • Includes a citation for the survey paper "LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods."

Maintenance & Community

The repository is maintained by CSHaitao and welcomes contributions via pull requests or direct contact. Updates are regularly posted, with recent activity including compilation of papers from NeurIPS 2024 and updates to the daily paper tracking.

Licensing & Compatibility

The repository itself does not specify a license. The linked papers are subject to their respective publisher or preprint server licenses.

Limitations & Caveats

This repository is a survey and does not provide executable code or tools. The effectiveness and robustness of LLMs-as-Judges are subject to ongoing research, with noted limitations including susceptibility to biases and adversarial attacks, as well as inherent weaknesses like knowledge recency and hallucination.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
27 stars in the last 30 days

Explore Similar Projects

Starred by Nir Gazit Nir Gazit(Cofounder of Traceloop), Jared Palmer Jared Palmer(Ex-VP AI at Vercel; Founder of Turborepo; Author of Formik, TSDX), and
3 more.

haven by redotvideo

0%
346
LLM fine-tuning and evaluation platform
Created 2 years ago
Updated 1 year ago
Starred by Junyang Lin Junyang Lin(Core Maintainer at Alibaba Qwen), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
1 more.

LMaaS-Papers by txsun1997

0%
549
Curated list of LMaaS research papers
Created 3 years ago
Updated 1 year ago
Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), and
3 more.

promptbench by microsoft

0.1%
3k
LLM evaluation framework
Created 2 years ago
Updated 1 month ago
Feedback? Help us improve.