CLI tool for question answering and summarization over web pages and documents
Top 40.9% on sourcepulse
ChatWeb is a tool for extracting and summarizing content from web pages, PDFs, DOCX, and TXT files, enabling users to ask questions based on the provided text. It targets users who need to process and query large documents or web content, offering a way to overcome token limits by leveraging embeddings and a vector database.
How It Works
The system crawls web pages or extracts text from documents, then uses GPT-3.5's embedding API to create vector representations for each text segment. A key innovation is generating vectors from keywords derived from user input, rather than the entire query, to improve search accuracy. These vectors are stored in a vector database, allowing for nearest neighbor searches to retrieve relevant text segments. GPT-3.5's chat API is then used to formulate answers based on these retrieved segments.
Quick Start & Requirements
pip3 install -r requirements.txt
), and run python3 main.py
. Docker is also supported via docker-compose up
.pgvector
extension for persistent storage.config.json
to set API keys, language, mode (console, api, webui), streaming, temperature, and proxy settings.Highlighted Details
pgvector
for enhanced data management.Maintenance & Community
The project is actively maintained by SkywalkerDarren. Further community or roadmap details are not explicitly provided in the README.
Licensing & Compatibility
The repository does not explicitly state a license. Users should verify licensing for commercial use or integration into closed-source projects.
Limitations & Caveats
The project is primarily reliant on OpenAI's GPT-3.5 API, incurring associated costs. While it lists many features as implemented, some items like "Other features that have not been thought of yet" remain open.
1 year ago
1 week