CRUD_RAG by IAAR-Shanghai

Chinese RAG benchmark for evaluating large language models

Created 1 year ago

356 stars

Top 78.6% on SourcePulse

Project Summary

CRUD-RAG provides a comprehensive benchmark and evaluation framework for Retrieval-Augmented Generation (RAG) systems, specifically targeting Chinese language models. It offers curated datasets and experimental code to assess RAG performance, benefiting researchers and developers working with LLMs in Chinese.

How It Works

The benchmark utilizes a multi-stage evaluation process, including standard metrics like BLEU, ROUGE, and BERTScore, alongside a novel RAGQuestEval metric. RAGQuestEval relies on a large language model (like GPT) to act as a question-answerer and generator, enabling a more nuanced assessment of RAG system capabilities. The system supports both API-based and locally deployed LLMs, with flexibility in configuring retrievers and embedding models.

Quick Start & Requirements

Install dependencies: pip install -r requirements.txt
Start Milvus Lite: milvus-server
Download sentence-transformers/bge-base-zh-v1.5 model.
Modify config.py.
Run: python quick_start.py with various arguments (e.g., --model_name, --data_path, --docs_path).
Initial vector index construction takes approximately 3 hours.
Requires Python, a vector database (Milvus Lite), and specific embedding models.

Highlighted Details

Includes over 80,000 Chinese news documents for retrieval database construction.
Supports multiple RAG tasks: continue writing, hallucinated modification, and question answering.
Offers flexibility in choosing LLMs (API or local) and embedding models.
Features a custom RAGQuestEval metric for in-depth RAG evaluation.

Maintenance & Community

The project is associated with IAAR-Shanghai and the ACM Transactions on Information Systems. No specific community links (Discord, Slack) are provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The RAGQuestEval metric's reliance on GPT necessitates either API access or self-deployment of a QA model. Prompts are optimized for ChatGPT and may require adjustments for other models, especially smaller 7B parameter models sensitive to prompt complexity.

CRUD_RAG by IAAR-Shanghai

Explore Similar Projects

LEval by OpenLMLab

RAGTune by misbahsy

bergen by naver

awesome-rag by coree

RGB by chen700564

DALM by arcee-ai

sgpt by Muennighoff

HiRAG by hhy-huang

BCEmbedding by netease-youdao

history_rag by wxywb

text2vec by shibing624

llmware by llmware-ai