CRUD_RAG  by IAAR-Shanghai

Chinese RAG benchmark for evaluating large language models

created 1 year ago
323 stars

Top 85.3% on sourcepulse

GitHubView on GitHub
Project Summary

CRUD-RAG provides a comprehensive benchmark and evaluation framework for Retrieval-Augmented Generation (RAG) systems, specifically targeting Chinese language models. It offers curated datasets and experimental code to assess RAG performance, benefiting researchers and developers working with LLMs in Chinese.

How It Works

The benchmark utilizes a multi-stage evaluation process, including standard metrics like BLEU, ROUGE, and BERTScore, alongside a novel RAGQuestEval metric. RAGQuestEval relies on a large language model (like GPT) to act as a question-answerer and generator, enabling a more nuanced assessment of RAG system capabilities. The system supports both API-based and locally deployed LLMs, with flexibility in configuring retrievers and embedding models.

Quick Start & Requirements

  • Install dependencies: pip install -r requirements.txt
  • Start Milvus Lite: milvus-server
  • Download sentence-transformers/bge-base-zh-v1.5 model.
  • Modify config.py.
  • Run: python quick_start.py with various arguments (e.g., --model_name, --data_path, --docs_path).
  • Initial vector index construction takes approximately 3 hours.
  • Requires Python, a vector database (Milvus Lite), and specific embedding models.

Highlighted Details

  • Includes over 80,000 Chinese news documents for retrieval database construction.
  • Supports multiple RAG tasks: continue writing, hallucinated modification, and question answering.
  • Offers flexibility in choosing LLMs (API or local) and embedding models.
  • Features a custom RAGQuestEval metric for in-depth RAG evaluation.

Maintenance & Community

The project is associated with IAAR-Shanghai and the ACM Transactions on Information Systems. No specific community links (Discord, Slack) are provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The RAGQuestEval metric's reliance on GPT necessitates either API access or self-deployment of a QA model. Prompts are optimized for ChatGPT and may require adjustments for other models, especially smaller 7B parameter models sensitive to prompt complexity.

Health Check
Last commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
26 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.