wiseflow by TeamWiseFlow

AI-powered web crawler for wide information gathering

Created 1 year ago

7,945 stars

Top 6.5% on SourcePulse

View on GitHub

1 Expert Loves This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Project Summary

Wiseflow is an AI-powered information extraction tool designed to help users sift through vast amounts of data from diverse sources to find what's relevant. It targets users who need to monitor industries, gather background information, or collect customer intelligence without the need for specific, deep-dive queries. The primary benefit is saving time and filtering out noise by focusing on "wide search" scenarios.

How It Works

Wiseflow employs a "crawl-and-search integration" strategy, departing from traditional filter-extractor pipelines. Instead of treating each page as a unit, it segments HTML content into "main text blocks" and "external link blocks." Different LLM extraction strategies are applied to each block type to optimize token usage and relevance. Main text blocks are summarized based on user-defined focus points, while external link blocks are analyzed to intelligently decide which links warrant further exploration, eliminating the need for manual configuration of crawl depth or quantity.

Quick Start & Requirements

Installation: Clone the repository, run install_pocketbase script (Linux/macOS: chmod +x install_pocketbase && ./install_pocketbase; Windows: install_pocketbase.ps1), configure core/.env, create a virtual environment (conda create -n wiseflow python=3.12 && conda activate wiseflow), install dependencies (cd wiseflow/core && pip install -r requirements.txt), install Playwright (python -m playwright install --with-deps chromium), and run (chmod +x run.sh && ./run.sh or python windows_run.py).
Prerequisites: Python 3.12, PocketBase (version 0.23.4 recommended), Playwright with Chromium. LLM services compatible with OpenAI SDK are required (e.g., Siliconflow, AiHubMix, Ollama, Xinference). Jina API key for search.
Setup: Configuration involves setting LLM API keys, base URLs, model names, and PocketBase credentials in .env.
Links: Online experience: https://www.aihubmix.com/. Docker deployment is also supported.

Highlighted Details

Specializes in "wide search" for broad information gathering, contrasting with "deep search" tools.
Features a unique "crawl-and-search integration" pipeline for efficient content block analysis.
Offers specialized parsing modules for specific content types, including WeChat articles.
Supports various LLM providers and local deployments compatible with OpenAI SDK.

Maintenance & Community

Recent contributors include @zhudongwork and @cdxiaodong.
Project is open-source under Apache 2.0. Contact for commercial partnerships.

Licensing & Compatibility

Licensed under Apache 2.0.
Permissive for commercial use and closed-source linking.

Limitations & Caveats

The online service has limitations for non-mainland China users and does not support WeChat official accounts. For these scenarios, self-deployment of the open-source version is recommended. The 4.x plan aims to introduce an "insight module" for analyzing "hidden information" within fetched data.

Health Check

Last Commit

5 days ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

47 stars in the last 30 days