RAG-QA-Generator  by wangxb96

Automated RAG knowledge base generation and management tool

Created 1 year ago
257 stars

Top 98.3% on SourcePulse

GitHubView on GitHub
Project Summary

RAG-QA-Generator automates the construction and management of knowledge bases for Retrieval-Augmented Generation (RAG) systems. It addresses the challenges of manual RAG knowledge base creation by using large language models to generate high-quality question-answer pairs from diverse document formats. This tool enhances RAG system development efficiency and accessibility for users, reducing manual intervention and improving knowledge base quality.

How It Works

The project processes documents (txt, pdf, docx) using langchain_community loaders and splits them into manageable text chunks. It then leverages AI, exemplified by the OpenAI API with models like qwen2.5-72b, to generate contextually relevant QA pairs via sophisticated prompts. These generated pairs are managed and stored in a backend database through a RESTful API. A Streamlit-based web interface provides a user-friendly experience for file uploads, QA generation previews, and knowledge base operations.

Quick Start & Requirements

  • Primary Install/Run:
    1. Clone the repository: git clone https://github.com/wangxb96/RAG-QA-Generator.git
    2. Navigate to the directory: cd RAG-QA-Generator
    3. Install dependencies: pip install -r requirements.txt
    4. Configure API keys and base URLs for OpenAI and the backend service.
    5. Run the Streamlit application: streamlit run AutoQAG.py
  • Prerequisites: Python 3.11.5. Key dependencies include streamlit==1.22.0, requests==2.31.0, openai==0.28.0, langchain==0.10.0, PyMuPDF==1.22.5, pandas==2.1.1, langchain_community==0.1.0. Requires API keys for OpenAI and a backend service (e.g., TaskingAI).
  • Links: GitHub Repository

Highlighted Details

  • AI-driven generation of diverse question-answer pairs from unstructured documents.
  • Support for multiple document formats (txt, pdf, docx) via langchain_community loaders.
  • Intuitive web interface built with Streamlit for user interaction.
  • Flexible knowledge base management, including collection creation, data insertion, and JSON export/import.

Maintenance & Community

The project acknowledges support from academic institutions in China, including Beijing Normal University's AI and Future Network Center, Smart Interdisciplinary Supercomputing Center, and the Ministry of Education Engineering Research Center for Big Data Cloud-Edge Intelligent Collaboration. No specific community channels (e.g., Discord, Slack) or active contributor information are detailed in the README.

Licensing & Compatibility

The README does not specify a software license. Users should verify licensing terms before adoption, especially concerning commercial use or integration with closed-source systems.

Limitations & Caveats

This project requires the configuration of external API keys for both the language model (e.g., OpenAI) and the backend knowledge base service. Processing large documents or generating extensive QA pairs can be time-consuming. The system's reliance on specific API endpoints and models may necessitate adaptation if those services evolve or change.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
10 stars in the last 30 days

Explore Similar Projects

Starred by Li Jiang Li Jiang(Coauthor of AutoGen; Engineer at Microsoft), Elie Bursztein Elie Bursztein(Cybersecurity Lead at Google DeepMind), and
1 more.

AutoRAG by Marker-Inc-Korea

0.1%
5k
RAG AutoML tool for optimizing RAG pipelines
Created 2 years ago
Updated 2 weeks ago
Feedback? Help us improve.