dr-doc-search  by namuan

CLI tool for conversing with PDF documents

created 2 years ago
600 stars

Top 55.2% on sourcepulse

GitHubView on GitHub
Project Summary

This project enables users to converse with PDF documents by leveraging large language models for question answering. It's designed for researchers, students, and anyone needing to extract information from lengthy texts, offering a conversational interface to complex documents.

How It Works

The tool processes PDF documents by first extracting text, potentially using OCR for scanned documents via Tesseract and ImageMagick. It then generates embeddings for the text chunks, which can be done using OpenAI's models or HuggingFace alternatives. These embeddings are stored in an index, allowing for efficient retrieval of relevant document sections based on user queries. Finally, a language model (like GPT-3 or a HuggingFace model) uses the retrieved context to formulate an answer.

Quick Start & Requirements

  • Install via pip: pip install dr-doc-search
  • Prerequisites: Tesseract OCR, ImageMagick. For Windows, set the IMCONV environment variable to the ImageMagick executable path.
  • Usage requires an OpenAI API key or HuggingFace models.
  • See documentation: https://namuan.github.io/dr-doc-search

Highlighted Details

  • Supports both OpenAI and HuggingFace for embeddings and LLM inference.
  • Offers a command-line interface and an optional web UI.
  • Generates an index and stores intermediate files in a user-defined output directory.
  • Allows specifying page ranges for processing.

Maintenance & Community

  • Built with contributions from LangChain, HoloViz Panel, and OpenAI.
  • Releases are automated via GitHub Actions upon version bumps.

Licensing & Compatibility

  • The README does not explicitly state a license. Compatibility for commercial or closed-source use is undetermined.

Limitations & Caveats

The project requires external dependencies like Tesseract OCR and ImageMagick, which may complicate setup on certain systems. The licensing is not specified, which could impact commercial adoption.

Health Check
Last commit

10 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.