Benchmark for LLM evaluation in Retrieval-Augmented Generation
Top 84.4% on sourcepulse
This repository provides an implementation for benchmarking Large Language Models (LLMs) in Retrieval-Augmented Generation (RAG) tasks, specifically focusing on news datasets. It offers refined datasets for evaluating information integration and counterfactual robustness, enabling researchers and developers to assess LLM performance under various conditions.
How It Works
The project utilizes a benchmark dataset containing refined versions of news articles, including corrected answers and adjusted document relevance. It evaluates LLMs by simulating RAG scenarios with controlled noise rates in retrieved documents and assesses their ability to integrate information and maintain robustness against counterfactual changes. The evaluation scripts calculate accuracy, rejection rates, and error detection/correction rates.
Quick Start & Requirements
conda create -n rgb python=3.10.0
followed by conda activate rgb
and bash env.sh
.data/
(e.g., en.json
, zh_refine.json
).python evalue.py
with specified dataset, model name, temperature, noise rate, API key, and passage number.Highlighted Details
_refine.json
) with corrected documents and answers.Maintenance & Community
No specific community channels or notable contributors are mentioned in the README.
Licensing & Compatibility
Licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License. Commercial use requires explicit permission.
Limitations & Caveats
The license strictly prohibits commercial use without formal permission. The project appears to be focused on specific benchmark evaluations rather than a general-purpose RAG framework.
1 year ago
1 day