RGB  by chen700564

Benchmark for LLM evaluation in Retrieval-Augmented Generation

created 1 year ago
328 stars

Top 84.4% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides an implementation for benchmarking Large Language Models (LLMs) in Retrieval-Augmented Generation (RAG) tasks, specifically focusing on news datasets. It offers refined datasets for evaluating information integration and counterfactual robustness, enabling researchers and developers to assess LLM performance under various conditions.

How It Works

The project utilizes a benchmark dataset containing refined versions of news articles, including corrected answers and adjusted document relevance. It evaluates LLMs by simulating RAG scenarios with controlled noise rates in retrieved documents and assesses their ability to integrate information and maintain robustness against counterfactual changes. The evaluation scripts calculate accuracy, rejection rates, and error detection/correction rates.

Quick Start & Requirements

  • Install: conda create -n rgb python=3.10.0 followed by conda activate rgb and bash env.sh.
  • Prerequisites: Python 3.10.0, Conda environment.
  • Data: Download datasets from data/ (e.g., en.json, zh_refine.json).
  • Evaluation: Run python evalue.py with specified dataset, model name, temperature, noise rate, API key, and passage number.
  • Links: Environment Setup

Highlighted Details

  • Supports evaluation of information integration and counterfactual robustness.
  • Includes refined datasets (_refine.json) with corrected documents and answers.
  • Evaluates LLM performance metrics like accuracy, rejection rate, and error detection/correction.
  • Allows customization of noise rates and the number of provided passages.

Maintenance & Community

No specific community channels or notable contributors are mentioned in the README.

Licensing & Compatibility

Licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License. Commercial use requires explicit permission.

Limitations & Caveats

The license strictly prohibits commercial use without formal permission. The project appears to be focused on specific benchmark evaluations rather than a general-purpose RAG framework.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
17 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.