document.ai by GanymedeNil

Local knowledge base solution using vector DB and GPT-3.5

Created 3 years ago

3,679 stars

Top 13.0% on SourcePulse

View on GitHub

3 Experts Love This Project

Yaowei Zheng

Author of LLaMA-Factory

Elvis Saravia

Founder of DAIR.AI

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

Project Summary

This project provides a universal local knowledge base solution leveraging vector databases and GPT-3.5. It's designed for users who need to build intelligent Q&A systems on their own data, offering a more refined conversational experience than simple keyword search.

How It Works

The core approach involves converting local question-answer datasets into vector embeddings and storing them in a vector database. When a user queries, their question is also vectorized and used to retrieve the top-K most similar answers from the database. GPT-3.5 is then employed to refine the structure and presentation of these retrieved answers, making responses more natural, especially in conversational contexts like customer service.

Quick Start & Requirements

Install: Not explicitly detailed, but implies Python environment setup.
Prerequisites: GPT-3.5 API access (OpenAI), vector database (e.g., FAISS, Milvus), Python.
Resources: Requires significant local data for embedding and potential fine-tuning.
Links: OpenAI Usage Policies

Highlighted Details

Addresses data inaccuracy through techniques like query splitting and topic extraction.
Explores self-training embedding models (e.g., text2vec-large-chinese, text2vec-cmedqq-lert-large) for domain-specific accuracy.
Discusses fine-tuning GPT models on domain-specific data for improved performance, acknowledging high costs.
Proposes a hybrid approach: periodic fine-tuning combined with external vector database similarity search.

Maintenance & Community

The project is a personal exploration (MSD case) with potential for broader application.
Mentions shibing624 as a contributor to a recommended embedding model.
No explicit community links (Discord, Slack) or roadmap are provided.

Licensing & Compatibility

The project itself does not specify a license in the README.
Relies on OpenAI's API, subject to their usage policies.

Limitations & Caveats

The project is presented as an exploration and may require significant engineering effort to adapt. Fine-tuning costs are noted as high, and the effectiveness of self-trained embedding models for highly specialized domains requires validation. The lack of explicit licensing could be a concern for commercial use.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

4 stars in the last 30 days