wdoc  by thiswillbeyourgithub

RAG system for summarizing and querying heterogeneous documents

created 2 years ago
474 stars

Top 65.3% on sourcepulse

GitHubView on GitHub
Project Summary

wdoc is a powerful Retrieval-Augmented Generation (RAG) system designed for summarizing, searching, and querying diverse document types at scale. It targets researchers, students, and professionals needing to extract insights from large, heterogeneous information sources, offering advanced RAG capabilities and customizable summaries.

How It Works

wdoc employs a sophisticated RAG pipeline that leverages multiple LLMs for enhanced accuracy and detail. It first retrieves relevant documents using embeddings, then uses a "query_eval LLM" (Eve the Evaluator) to filter out irrelevant results. A "strong LLM" (Anna the Answerer) generates answers from the remaining documents, which are then semantically clustered and combined by a "combiner LLM" (Carl the Combiner) into a single, sourced answer. This multi-stage approach, combined with advanced embedding search strategies and semantic batching, aims for high recall and specificity.

Quick Start & Requirements

  • Install via pip: pip install -U wdoc
  • Recommended: pip install -U wdoc[pdftotext] and pip install -U wdoc[fasttext]
  • Requires LLM API keys (e.g., OPENAI_API_KEY) set as environment variables.
  • Supports Python 3.11.7+.
  • See examples.md for detailed usage.

Highlighted Details

  • Supports 15+ filetypes, combinable within a single index (e.g., PDFs, EPUBs, Anki, YouTube).
  • Integrates with virtually any LLM provider and embedding models via litellm.
  • Offers advanced RAG with "Eve the Evaluator," "Anna the Answerer," and "Carl the Combiner" LLM roles.
  • Provides AI-powered summaries focusing on author's reasoning and thought process.
  • Answers are sourced, allowing verification of information.

Maintenance & Community

  • Actively developed by a medical student, used daily by the developer.
  • Open to feature requests and pull requests; issues are encouraged before PRs.
  • Documentation is extensive, including docstrings, comments, and a comprehensive --help output.

Licensing & Compatibility

  • The repository does not explicitly state a license in the provided README. Further investigation into the repository's files is recommended for licensing details and commercial use compatibility.

Limitations & Caveats

The project is in alpha status and may have instabilities, though issues are reportedly fixed quickly. The main branch is more stable than the development branch. Some advanced features like recursive summarization or handling extremely large documents might require careful configuration or may still be under active development.

Health Check
Last commit

1 week ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
27 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.