document.ai  by GanymedeNil

Local knowledge base solution using vector DB and GPT-3.5

created 2 years ago
3,678 stars

Top 13.5% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a universal local knowledge base solution leveraging vector databases and GPT-3.5. It's designed for users who need to build intelligent Q&A systems on their own data, offering a more refined conversational experience than simple keyword search.

How It Works

The core approach involves converting local question-answer datasets into vector embeddings and storing them in a vector database. When a user queries, their question is also vectorized and used to retrieve the top-K most similar answers from the database. GPT-3.5 is then employed to refine the structure and presentation of these retrieved answers, making responses more natural, especially in conversational contexts like customer service.

Quick Start & Requirements

  • Install: Not explicitly detailed, but implies Python environment setup.
  • Prerequisites: GPT-3.5 API access (OpenAI), vector database (e.g., FAISS, Milvus), Python.
  • Resources: Requires significant local data for embedding and potential fine-tuning.
  • Links: OpenAI Usage Policies

Highlighted Details

  • Addresses data inaccuracy through techniques like query splitting and topic extraction.
  • Explores self-training embedding models (e.g., text2vec-large-chinese, text2vec-cmedqq-lert-large) for domain-specific accuracy.
  • Discusses fine-tuning GPT models on domain-specific data for improved performance, acknowledging high costs.
  • Proposes a hybrid approach: periodic fine-tuning combined with external vector database similarity search.

Maintenance & Community

  • The project is a personal exploration (MSD case) with potential for broader application.
  • Mentions shibing624 as a contributor to a recommended embedding model.
  • No explicit community links (Discord, Slack) or roadmap are provided.

Licensing & Compatibility

  • The project itself does not specify a license in the README.
  • Relies on OpenAI's API, subject to their usage policies.

Limitations & Caveats

The project is presented as an exploration and may require significant engineering effort to adapt. Fine-tuning costs are noted as high, and the effectiveness of self-trained embedding models for highly specialized domains requires validation. The lack of explicit licensing could be a concern for commercial use.

Health Check
Last commit

2 years ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
9 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.