Discover and explore top open-source AI tools and projects—updated daily.
smartyfhBenchmarking LLMs via uncertainty quantification
Top 99.1% on SourcePulse
This project introduces a novel benchmarking framework for Large Language Models (LLMs) that integrates uncertainty quantification, addressing a gap in current evaluation methods that primarily focus on accuracy. It is designed for researchers and practitioners seeking a more comprehensive understanding of LLM performance, offering a new metric, UAcc, to reveal nuances beyond simple accuracy scores.
How It Works
The core of the project is the application of conformal prediction for uncertainty quantification, chosen for its ease of implementation, efficiency, and rigorous theoretical grounding. The framework evaluates LLMs across five representative NLP tasks (Question Answering, Reading Comprehension, Commonsense Inference, Dialogue Response Selection, Document Summarization), each formulated as a multiple-choice question. A new metric, Uncertainty-aware Accuracy (UAcc), is proposed, which considers both the correctness of a prediction and its associated uncertainty, providing a more holistic performance assessment.
Quick Start & Requirements
pip install -r requirements.txtgenerate_logits.py, generate_logits_chat.py) and performing uncertainty quantification via conformal prediction (uncertainty_quantification_via_cp.py).Highlighted Details
Maintenance & Community
Direct contact is available via email at fanghua.ye.21@gmail.com. No specific community channels (e.g., Discord, Slack) or roadmap links are provided in the README.
Licensing & Compatibility
The provided README does not specify a software license. Users should verify licensing terms before adoption, especially for commercial or closed-source applications.
Limitations & Caveats
The evaluation is focused on specific NLP tasks and relies on datasets derived from existing benchmarks. The performance and findings are specific to the tested LLMs and the chosen uncertainty quantification methodology. The absence of a stated license is a significant point for potential adopters to clarify.
1 year ago
Inactive