RGB by chen700564

Benchmark for LLM evaluation in Retrieval-Augmented Generation

Created 2 years ago

355 stars

Top 78.8% on SourcePulse

Project Summary

This repository provides an implementation for benchmarking Large Language Models (LLMs) in Retrieval-Augmented Generation (RAG) tasks, specifically focusing on news datasets. It offers refined datasets for evaluating information integration and counterfactual robustness, enabling researchers and developers to assess LLM performance under various conditions.

How It Works

The project utilizes a benchmark dataset containing refined versions of news articles, including corrected answers and adjusted document relevance. It evaluates LLMs by simulating RAG scenarios with controlled noise rates in retrieved documents and assesses their ability to integrate information and maintain robustness against counterfactual changes. The evaluation scripts calculate accuracy, rejection rates, and error detection/correction rates.

Quick Start & Requirements

Install: conda create -n rgb python=3.10.0 followed by conda activate rgb and bash env.sh.
Prerequisites: Python 3.10.0, Conda environment.
Data: Download datasets from data/ (e.g., en.json, zh_refine.json).
Evaluation: Run python evalue.py with specified dataset, model name, temperature, noise rate, API key, and passage number.
Links: Environment Setup

Highlighted Details

Supports evaluation of information integration and counterfactual robustness.
Includes refined datasets (_refine.json) with corrected documents and answers.
Evaluates LLM performance metrics like accuracy, rejection rate, and error detection/correction.
Allows customization of noise rates and the number of provided passages.

Maintenance & Community

No specific community channels or notable contributors are mentioned in the README.

Licensing & Compatibility

Licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License. Commercial use requires explicit permission.

Limitations & Caveats

The license strictly prohibits commercial use without formal permission. The project appears to be focused on specific benchmark evaluations rather than a general-purpose RAG framework.

RGB by chen700564

Explore Similar Projects

LEval by OpenLMLab

RAGTune by misbahsy

CRUD_RAG by IAAR-Shanghai

in-context-ralm by AI21Labs

DALM by arcee-ai

CRAG by HuskyInSalt

Rankify by DataScienceUIBK

ARES by stanford-futuredata

HiRAG by hhy-huang

auto-evaluator by langchain-ai

TrustRAG by gomate-community

RAG_Techniques by NirDiamant