doc-solver  by ai-hermes

ChatGPT chatbot for PDF documents

created 1 year ago
339 stars

Top 82.4% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a Next.js-based chatbot that allows users to query their PDF documents using GPT-4 via LangChain and Pinecone. It's designed for developers and researchers who need to build custom Q&A systems over private document sets. The primary benefit is enabling conversational access to information contained within PDF files.

How It Works

The application leverages LangChain for orchestrating LLM interactions and document processing. PDFs are converted into text, chunked, and then embedded using OpenAI's models. These embeddings, along with the original text chunks, are stored in Pinecone, a vector database, for efficient similarity search. When a user asks a question, it's embedded and used to retrieve the most relevant text chunks from Pinecone, which are then passed to GPT-4 along with the question to generate an answer.

Quick Start & Requirements

  • Install: yarn install
  • Prerequisites: Node.js version >= 18, Yarn, OpenAI API key, Pinecone API key, environment, and index name. Vector dimensions for Pinecone must be set to 1536.
  • Setup: Requires configuring API keys and Pinecone details in a .env file. Ingesting documents involves placing PDFs in the docs folder and running yarn run ingest.
  • Run: npm run dev
  • Demo: https://docsolver.spotty.com.cn/

Highlighted Details

  • Utilizes GPT-4 API for enhanced response quality.
  • Supports multiple PDF files for ingestion.
  • Customizable QA prompt within utils/makechain.ts.
  • Frontend inspired by langchain-chat-nextjs.

Maintenance & Community

  • No specific contributors, sponsorships, or community links (Discord/Slack) are mentioned in the README.

Licensing & Compatibility

  • Licensed under the Apache License, Copyright © 2021-present doc-solver. This license is permissive and generally compatible with commercial use.

Limitations & Caveats

The application requires access to the GPT-4 API, and failure to have access will prevent it from working. PDFs that are scanned or require OCR may not be processed correctly without pre-conversion to text. Pinecone starter plan indexes are deleted after 7 days of inactivity, requiring potential re-ingestion.

Health Check
Last commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.