RAG system for summarizing and querying heterogeneous documents
Top 65.3% on sourcepulse
wdoc is a powerful Retrieval-Augmented Generation (RAG) system designed for summarizing, searching, and querying diverse document types at scale. It targets researchers, students, and professionals needing to extract insights from large, heterogeneous information sources, offering advanced RAG capabilities and customizable summaries.
How It Works
wdoc employs a sophisticated RAG pipeline that leverages multiple LLMs for enhanced accuracy and detail. It first retrieves relevant documents using embeddings, then uses a "query_eval LLM" (Eve the Evaluator) to filter out irrelevant results. A "strong LLM" (Anna the Answerer) generates answers from the remaining documents, which are then semantically clustered and combined by a "combiner LLM" (Carl the Combiner) into a single, sourced answer. This multi-stage approach, combined with advanced embedding search strategies and semantic batching, aims for high recall and specificity.
Quick Start & Requirements
pip install -U wdoc
pip install -U wdoc[pdftotext]
and pip install -U wdoc[fasttext]
OPENAI_API_KEY
) set as environment variables.Highlighted Details
Maintenance & Community
--help
output.Licensing & Compatibility
Limitations & Caveats
The project is in alpha status and may have instabilities, though issues are reportedly fixed quickly. The main branch is more stable than the development branch. Some advanced features like recursive summarization or handling extremely large documents might require careful configuration or may still be under active development.
1 week ago
1 day