wdoc by thiswillbeyourgithub

RAG system for summarizing and querying heterogeneous documents

Created 2 years ago

501 stars

Top 62.1% on SourcePulse

View on GitHub

1 Expert Loves This Project

Boris Cherny

Creator of Claude Code; MTS at Anthropic

Project Summary

wdoc is a powerful Retrieval-Augmented Generation (RAG) system designed for summarizing, searching, and querying diverse document types at scale. It targets researchers, students, and professionals needing to extract insights from large, heterogeneous information sources, offering advanced RAG capabilities and customizable summaries.

How It Works

wdoc employs a sophisticated RAG pipeline that leverages multiple LLMs for enhanced accuracy and detail. It first retrieves relevant documents using embeddings, then uses a "query_eval LLM" (Eve the Evaluator) to filter out irrelevant results. A "strong LLM" (Anna the Answerer) generates answers from the remaining documents, which are then semantically clustered and combined by a "combiner LLM" (Carl the Combiner) into a single, sourced answer. This multi-stage approach, combined with advanced embedding search strategies and semantic batching, aims for high recall and specificity.

Quick Start & Requirements

Install via pip: pip install -U wdoc
Recommended: pip install -U wdoc[pdftotext] and pip install -U wdoc[fasttext]
Requires LLM API keys (e.g., OPENAI_API_KEY) set as environment variables.
Supports Python 3.11.7+.
See examples.md for detailed usage.

Highlighted Details

Supports 15+ filetypes, combinable within a single index (e.g., PDFs, EPUBs, Anki, YouTube).
Integrates with virtually any LLM provider and embedding models via litellm.
Offers advanced RAG with "Eve the Evaluator," "Anna the Answerer," and "Carl the Combiner" LLM roles.
Provides AI-powered summaries focusing on author's reasoning and thought process.
Answers are sourced, allowing verification of information.

Maintenance & Community

Actively developed by a medical student, used daily by the developer.
Open to feature requests and pull requests; issues are encouraged before PRs.
Documentation is extensive, including docstrings, comments, and a comprehensive --help output.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Further investigation into the repository's files is recommended for licensing details and commercial use compatibility.

Limitations & Caveats

The project is in alpha status and may have instabilities, though issues are reportedly fixed quickly. The main branch is more stable than the development branch. Some advanced features like recursive summarization or handling extremely large documents might require careful configuration or may still be under active development.

Health Check

Last Commit

5 days ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

9 stars in the last 30 days