Survey paper for LLM-based evaluation methods
Top 72.8% on sourcepulse
This repository serves as a comprehensive survey and resource hub for "LLMs-as-Judges," a rapidly evolving field where Large Language Models are employed for evaluation tasks across various domains like text generation, question answering, and dialogue systems. It targets researchers, developers, and practitioners seeking to understand and leverage LLM-based evaluation methods for model assessment and enhancement.
How It Works
The project categorizes LLM-as-a-Judge methodologies into single-LLM systems (prompt-based, tuning-based, post-processing) and multi-LLM systems (communication, aggregation), alongside human-AI collaboration. It details applications across general text, multimodal, medical, legal, financial, and educational domains, and critically examines meta-evaluation benchmarks, metrics, limitations, biases, and adversarial attacks.
Quick Start & Requirements
This repository is primarily a curated list of papers and research. There are no direct installation or execution commands provided. Access to the papers requires standard academic research access.
Highlighted Details
Maintenance & Community
The repository is maintained by CSHaitao and welcomes contributions via pull requests or direct contact. Updates are regularly posted, with recent activity including compilation of papers from NeurIPS 2024 and updates to the daily paper tracking.
Licensing & Compatibility
The repository itself does not specify a license. The linked papers are subject to their respective publisher or preprint server licenses.
Limitations & Caveats
This repository is a survey and does not provide executable code or tools. The effectiveness and robustness of LLMs-as-Judges are subject to ongoing research, with noted limitations including susceptibility to biases and adversarial attacks, as well as inherent weaknesses like knowledge recency and hallucination.
4 days ago
Inactive