chatWeb by SkywalkerDarren

CLI tool for question answering and summarization over web pages and documents

Created 2 years ago

912 stars

Top 39.9% on SourcePulse

Project Summary

ChatWeb is a tool for extracting and summarizing content from web pages, PDFs, DOCX, and TXT files, enabling users to ask questions based on the provided text. It targets users who need to process and query large documents or web content, offering a way to overcome token limits by leveraging embeddings and a vector database.

How It Works

The system crawls web pages or extracts text from documents, then uses GPT-3.5's embedding API to create vector representations for each text segment. A key innovation is generating vectors from keywords derived from user input, rather than the entire query, to improve search accuracy. These vectors are stored in a vector database, allowing for nearest neighbor searches to retrieve relevant text segments. GPT-3.5's chat API is then used to formulate answers based on these retrieved segments.

Quick Start & Requirements

Installation: Clone the repository, install dependencies (pip3 install -r requirements.txt), and run python3 main.py. Docker is also supported via docker-compose up.
Prerequisites: Python 3, OpenAI API key.
Optional: PostgreSQL with the pgvector extension for persistent storage.
Configuration: Edit config.json to set API keys, language, mode (console, api, webui), streaming, temperature, and proxy settings.
Demo: http://localhost:7860 (default web UI port).

Highlighted Details

Supports web page crawling and extraction from PDF, DOCX, TXT files.
Offers multiple operational modes: console, API, and web UI.
Includes optional PostgreSQL integration with pgvector for enhanced data management.
Features configurable OpenAI API proxy settings and response temperature.

Maintenance & Community

The project is actively maintained by SkywalkerDarren. Further community or roadmap details are not explicitly provided in the README.

Licensing & Compatibility

The repository does not explicitly state a license. Users should verify licensing for commercial use or integration into closed-source projects.

Limitations & Caveats

The project is primarily reliant on OpenAI's GPT-3.5 API, incurring associated costs. While it lists many features as implemented, some items like "Other features that have not been thought of yet" remain open.

chatWeb by SkywalkerDarren

Explore Similar Projects

yacy_expert by yacy

ask.py by pengfeng

embedchainjs by mem0ai

wait-but-why-gpt by mckaywrigley

ai-template by Jordan-Gilliam

DataChad by gustavz

web-explorer by langchain-ai

semantic-search-nextjs-pinecone-langchain-chatgpt by dabit3

orama by oramasearch

LangChain-ChatGLM-Webui by X-D-Lab

chatgpt-retrieval by techleadhd

WeKnora by Tencent