wiseflow  by TeamWiseFlow

AI-powered web crawler for wide information gathering

created 1 year ago
7,719 stars

Top 6.9% on sourcepulse

GitHubView on GitHub
Project Summary

Wiseflow is an AI-powered information extraction tool designed to help users sift through vast amounts of data from diverse sources to find what's relevant. It targets users who need to monitor industries, gather background information, or collect customer intelligence without the need for specific, deep-dive queries. The primary benefit is saving time and filtering out noise by focusing on "wide search" scenarios.

How It Works

Wiseflow employs a "crawl-and-search integration" strategy, departing from traditional filter-extractor pipelines. Instead of treating each page as a unit, it segments HTML content into "main text blocks" and "external link blocks." Different LLM extraction strategies are applied to each block type to optimize token usage and relevance. Main text blocks are summarized based on user-defined focus points, while external link blocks are analyzed to intelligently decide which links warrant further exploration, eliminating the need for manual configuration of crawl depth or quantity.

Quick Start & Requirements

  • Installation: Clone the repository, run install_pocketbase script (Linux/macOS: chmod +x install_pocketbase && ./install_pocketbase; Windows: install_pocketbase.ps1), configure core/.env, create a virtual environment (conda create -n wiseflow python=3.12 && conda activate wiseflow), install dependencies (cd wiseflow/core && pip install -r requirements.txt), install Playwright (python -m playwright install --with-deps chromium), and run (chmod +x run.sh && ./run.sh or python windows_run.py).
  • Prerequisites: Python 3.12, PocketBase (version 0.23.4 recommended), Playwright with Chromium. LLM services compatible with OpenAI SDK are required (e.g., Siliconflow, AiHubMix, Ollama, Xinference). Jina API key for search.
  • Setup: Configuration involves setting LLM API keys, base URLs, model names, and PocketBase credentials in .env.
  • Links: Online experience: https://www.aihubmix.com/. Docker deployment is also supported.

Highlighted Details

  • Specializes in "wide search" for broad information gathering, contrasting with "deep search" tools.
  • Features a unique "crawl-and-search integration" pipeline for efficient content block analysis.
  • Offers specialized parsing modules for specific content types, including WeChat articles.
  • Supports various LLM providers and local deployments compatible with OpenAI SDK.

Maintenance & Community

  • Recent contributors include @zhudongwork and @cdxiaodong.
  • Project is open-source under Apache 2.0. Contact for commercial partnerships.

Licensing & Compatibility

  • Licensed under Apache 2.0.
  • Permissive for commercial use and closed-source linking.

Limitations & Caveats

The online service has limitations for non-mainland China users and does not support WeChat official accounts. For these scenarios, self-deployment of the open-source version is recommended. The 4.x plan aims to introduce an "insight module" for analyzing "hidden information" within fetched data.

Health Check
Last commit

3 days ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
2
Star History
360 stars in the last 90 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Alex Cheema Alex Cheema(Cofounder of EXO Labs), and
3 more.

Perplexica by ItzCrazyKns

0.3%
23k
AI-powered search engine alternative
created 1 year ago
updated 1 day ago
Feedback? Help us improve.