harvester by wzdnzd

Intelligent data acquisition framework for GitHub and web sources

Created 5 months ago

537 stars

Top 59.1% on SourcePulse

Project Summary

Harvester is a universal, adaptive data acquisition framework designed for comprehensive information gathering from diverse sources like GitHub, network mapping platforms (FOFA, Shodan), and arbitrary web endpoints. It targets developers and researchers needing to automate data collection, offering an extensible plugin-based architecture and an intelligent, multi-stage processing pipeline for tasks such as AI service provider key discovery.

How It Works

Harvester employs a layered architecture with a core pipeline engine that orchestrates data acquisition through a series of stages: Search, Gather, Check, and Inspect. The Search stage utilizes an advanced Query Optimization Engine, leveraging mathematical modeling and enumeration strategies to efficiently query data sources like GitHub. Subsequent stages acquire detailed information, validate credentials against AI service endpoints, and inspect API capabilities. The framework is built for extensibility, supporting a plugin architecture for new data sources and processors, and features asynchronous, multi-threaded processing with adaptive rate limiting and robust error handling.

Quick Start & Requirements

Installation: Clone the repository and install dependencies using pip install -r requirements.txt.
Requirements: Python 3.10+. Optional: uvloop for performance on Linux/macOS.
Configuration: Generate a default configuration (python main.py --create-config) or copy from examples (examples/config-simple.yaml, examples/config-full.yaml) and edit config.yaml. Key configuration includes GitHub credentials, provider settings, rate limits, and pipeline threads.
Running: Execute python main.py with optional configuration flags.
Documentation: Refer to the project's README and DeepWiki for detailed guides.

Highlighted Details

Advanced Query Optimization Engine: Employs mathematical modeling, splittability analysis, and enumeration strategies to optimize search queries.
Multi-Stage Pipeline: Features a configurable 4-stage (Search, Gather, Check, Inspect) processing flow with DAG execution.
Extensible Architecture: Supports plugin-style providers and custom pipeline stages for easy integration of new data sources and processors.
Enterprise-Ready Features: Includes fault tolerance, state persistence, credential management, rate limiting, and real-time monitoring.

Maintenance & Community

The project is actively maintained, with contributions welcome. Users can contact maintainers via GitHub Issues.

Licensing & Compatibility

Licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). Commercial use is strictly prohibited without explicit written permission.

Limitations & Caveats

This project is intended solely for educational and technical research purposes. Users must comply with all applicable laws, regulations, and third-party terms of service. The authors disclaim responsibility for any misuse, legal issues, or damages. Respect for intellectual property and privacy is paramount.

harvester by wzdnzd

Explore Similar Projects

Ace-Mcp-Node by yeuxuan

mcp-omnisearch by spences10

ryoma by project-ryoma

edenai-apis by edenai

ii-researcher by Intelligent-Internet

chronon by airbnb

trustfall by obi1kenobi

tavily-mcp by tavily-ai

scraping-apis-for-devs by cporter202

airweave by airweave-ai

langmanus by Darwin-lfl

scira by zaidmukaddam