PdfGptIndexer by raghavan

RAG tool for indexing and searching PDF text data

Created 2 years ago

680 stars

Top 49.9% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jeffrey Morgan

Cofounder of Ollama

Project Summary

This tool enables rapid information retrieval and superior search accuracy from PDF documents using a Retrieval-Augmented Generation (RAG) approach. It's designed for users who need to quickly find information within large collections of PDFs, leveraging OpenAI's embedding models and FAISS for efficient similarity search.

How It Works

The system extracts text from PDFs using textract, then segments it into chunks via Hugging Face transformers. These chunks are converted into vector embeddings using OpenAI's text-embedding-ada-002 via langchain. The embeddings are stored locally in a FAISS index for fast, offline, and computationally efficient retrieval. A query interface allows users to ask questions, retrieve relevant text chunks, and receive answers.

Quick Start & Requirements

Install dependencies: pip install langchain openai textract transformers langchain faiss-cpu pypdf tiktoken
Requires an OpenAI API key.
Run: python3 pdf_gpt_indexer.py
Ensure PDF document and output directories exist.
Further guidance available at Hacker News feature and a guide on using ChatGPT with custom data.

Highlighted Details

Leverages FAISS for efficient, local storage of dense vector embeddings.
Enables offline access and compute savings by pre-computing embeddings.
Designed for scalability with large datasets.
Utilizes textract for broad PDF text extraction capabilities.

Maintenance & Community

The project gained significant attention, being featured on Hacker News. No specific community channels or roadmap are mentioned in the README.

Licensing & Compatibility

The README does not explicitly state a license. The use of OpenAI API and libraries like Langchain implies adherence to their respective terms of service and licenses. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project requires an OpenAI API key, incurring costs for embedding generation. The README does not detail performance benchmarks or specific limitations regarding PDF complexity or size. It relies on external libraries, and their compatibility or potential breaking changes could impact the tool.

Health Check

Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days