chatWeb  by SkywalkerDarren

CLI tool for question answering and summarization over web pages and documents

Created 2 years ago
909 stars

Top 40.1% on SourcePulse

GitHubView on GitHub
Project Summary

ChatWeb is a tool for extracting and summarizing content from web pages, PDFs, DOCX, and TXT files, enabling users to ask questions based on the provided text. It targets users who need to process and query large documents or web content, offering a way to overcome token limits by leveraging embeddings and a vector database.

How It Works

The system crawls web pages or extracts text from documents, then uses GPT-3.5's embedding API to create vector representations for each text segment. A key innovation is generating vectors from keywords derived from user input, rather than the entire query, to improve search accuracy. These vectors are stored in a vector database, allowing for nearest neighbor searches to retrieve relevant text segments. GPT-3.5's chat API is then used to formulate answers based on these retrieved segments.

Quick Start & Requirements

  • Installation: Clone the repository, install dependencies (pip3 install -r requirements.txt), and run python3 main.py. Docker is also supported via docker-compose up.
  • Prerequisites: Python 3, OpenAI API key.
  • Optional: PostgreSQL with the pgvector extension for persistent storage.
  • Configuration: Edit config.json to set API keys, language, mode (console, api, webui), streaming, temperature, and proxy settings.
  • Demo: http://localhost:7860 (default web UI port).

Highlighted Details

  • Supports web page crawling and extraction from PDF, DOCX, TXT files.
  • Offers multiple operational modes: console, API, and web UI.
  • Includes optional PostgreSQL integration with pgvector for enhanced data management.
  • Features configurable OpenAI API proxy settings and response temperature.

Maintenance & Community

The project is actively maintained by SkywalkerDarren. Further community or roadmap details are not explicitly provided in the README.

Licensing & Compatibility

The repository does not explicitly state a license. Users should verify licensing for commercial use or integration into closed-source projects.

Limitations & Caveats

The project is primarily reliant on OpenAI's GPT-3.5 API, incurring associated costs. While it lists many features as implemented, some items like "Other features that have not been thought of yet" remain open.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.