Chat_with_Datawhale_langchain  by logan-zou

RAG for personal knowledge base Q&A

created 1 year ago
324 stars

Top 85.2% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a personal knowledge base assistant that leverages large language models (LLMs) to answer questions based on provided documentation. It's designed for users who need to efficiently query and retrieve information from extensive, complex datasets, offering a streamlined approach to knowledge management and access.

How It Works

The core of the project is a Retrieval-Augmented Generation (RAG) pipeline built with Langchain. It ingests various document formats (PDF, Markdown, TXT), splits them into manageable chunks, and generates vector embeddings using models like m3e or OpenAI. These embeddings are stored in a Chroma vector database for efficient similarity search. When a user asks a question, the system vectorizes the query, retrieves the most relevant document chunks from the database, and feeds them as context to an LLM (supporting OpenAI, Ernie Bot, Spark, and ChatGLM) to generate a concise answer.

Quick Start & Requirements

  • Installation: Clone the repository, create a Conda environment (conda create -n llm-universe python==3.9.0), activate it (conda activate llm-universe), and install dependencies (pip install -r requirements.txt).
  • Prerequisites: Python >= 3.9, PyTorch >= 2.0.0. CPU: Intel 5th gen or equivalent (2+ core cloud CPU recommended). RAM: 4GB minimum.
  • Running:
    • Local API: cd project/serve then uvicorn api:app --reload (Linux) or python api.py (Windows).
    • Gradio Demo: cd llm-universe/project/serve then python run_gradio.py -model_name='chatglm_std' -embedding_model='m3e' -db_path='../../data_base/knowledge_db' -persist_path='../../data_base/vector_db'
  • Documentation: Project Repository

Highlighted Details

  • Supports multiple LLM APIs (OpenAI, Ernie Bot, Spark, ChatGLM) and embedding models (m3e, OpenAI, ZhipuAI).
  • Includes scripts to automatically fetch and summarize READMEs from the Datawhale GitHub organization.
  • Handles various document types (PDF, Markdown) using loaders like PyMuPDFLoader and UnstructuredMarkdownLoader.
  • Implements both stateless and stateful (conversational memory) QA chains.

Maintenance & Community

  • Current version: 0.2.0 (updated March 17, 2024).
  • Future plans include user-uploaded knowledge bases, Multi-Agent frameworks, and improved retrieval accuracy.

Licensing & Compatibility

  • The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would require clarification.

Limitations & Caveats

  • The README mentions potential rate limiting or "wind control" issues when using OpenAI for summarization, requiring 60-second delays.
  • Handling complex document structures (charts, images) within PDFs requires custom fine-tuning.
  • The project relies on external API keys for many LLM and embedding services.
Health Check
Last commit

1 year ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
62 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Elie Bursztein Elie Bursztein(Cybersecurity Lead at Google DeepMind).

LightRAG by HKUDS

1.0%
19k
RAG framework for fast, simple retrieval-augmented generation
created 10 months ago
updated 18 hours ago
Feedback? Help us improve.