LLM-Uncertainty-Bench by smartyfh

Benchmarking LLMs via uncertainty quantification

Created 2 years ago

259 stars

Top 97.8% on SourcePulse

Project Summary

This project introduces a novel benchmarking framework for Large Language Models (LLMs) that integrates uncertainty quantification, addressing a gap in current evaluation methods that primarily focus on accuracy. It is designed for researchers and practitioners seeking a more comprehensive understanding of LLM performance, offering a new metric, UAcc, to reveal nuances beyond simple accuracy scores.

How It Works

The core of the project is the application of conformal prediction for uncertainty quantification, chosen for its ease of implementation, efficiency, and rigorous theoretical grounding. The framework evaluates LLMs across five representative NLP tasks (Question Answering, Reading Comprehension, Commonsense Inference, Dialogue Response Selection, Document Summarization), each formulated as a multiple-choice question. A new metric, Uncertainty-aware Accuracy (UAcc), is proposed, which considers both the correctness of a prediction and its associated uncertainty, providing a more holistic performance assessment.

Quick Start & Requirements

Installation: pip install -r requirements.txt
Prerequisites: Python 3.10.13.
Usage: Scripts are provided for generating model logits (generate_logits.py, generate_logits_chat.py) and performing uncertainty quantification via conformal prediction (uncertainty_quantification_via_cp.py).
Resources: Links to the research paper (arXiv:2401.12794) and datasets (HuggingFace ErikYip/LLM-Uncertainty-Bench) are available.

Highlighted Details

Introduces the Uncertainty-aware Accuracy (UAcc) metric, which can alter LLM rankings compared to accuracy-only evaluations.
Evaluates eight LLMs, including Llama-2, Mistral, Falcon, Yi, Qwen, DeepSeek, and InternLM, across five distinct NLP tasks.
Key findings indicate that higher accuracy LLMs may exhibit lower certainty, larger models can display greater uncertainty, and instruction-finetuning tends to increase LLM uncertainty.
Utilizes conformal prediction for statistically rigorous uncertainty estimation.

Maintenance & Community

Direct contact is available via email at fanghua.ye.21@gmail.com. No specific community channels (e.g., Discord, Slack) or roadmap links are provided in the README.

Licensing & Compatibility

The provided README does not specify a software license. Users should verify licensing terms before adoption, especially for commercial or closed-source applications.

Limitations & Caveats

The evaluation is focused on specific NLP tasks and relies on datasets derived from existing benchmarks. The performance and findings are specific to the tested LLMs and the chosen uncertainty quantification methodology. The absence of a stated license is a significant point for potential adopters to clarify.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

4 stars in the last 30 days