Chinese RAG benchmark for evaluating large language models
Top 85.3% on sourcepulse
CRUD-RAG provides a comprehensive benchmark and evaluation framework for Retrieval-Augmented Generation (RAG) systems, specifically targeting Chinese language models. It offers curated datasets and experimental code to assess RAG performance, benefiting researchers and developers working with LLMs in Chinese.
How It Works
The benchmark utilizes a multi-stage evaluation process, including standard metrics like BLEU, ROUGE, and BERTScore, alongside a novel RAGQuestEval metric. RAGQuestEval relies on a large language model (like GPT) to act as a question-answerer and generator, enabling a more nuanced assessment of RAG system capabilities. The system supports both API-based and locally deployed LLMs, with flexibility in configuring retrievers and embedding models.
Quick Start & Requirements
pip install -r requirements.txt
milvus-server
sentence-transformers/bge-base-zh-v1.5
model.config.py
.python quick_start.py
with various arguments (e.g., --model_name
, --data_path
, --docs_path
).Highlighted Details
Maintenance & Community
The project is associated with IAAR-Shanghai and the ACM Transactions on Information Systems. No specific community links (Discord, Slack) are provided in the README.
Licensing & Compatibility
The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The RAGQuestEval metric's reliance on GPT necessitates either API access or self-deployment of a QA model. Prompts are optimized for ChatGPT and may require adjustments for other models, especially smaller 7B parameter models sensitive to prompt complexity.
2 months ago
1 day