harvester  by wzdnzd

Intelligent data acquisition framework for GitHub and web sources

Created 5 months ago
537 stars

Top 59.1% on SourcePulse

GitHubView on GitHub
Project Summary

Harvester is a universal, adaptive data acquisition framework designed for comprehensive information gathering from diverse sources like GitHub, network mapping platforms (FOFA, Shodan), and arbitrary web endpoints. It targets developers and researchers needing to automate data collection, offering an extensible plugin-based architecture and an intelligent, multi-stage processing pipeline for tasks such as AI service provider key discovery.

How It Works

Harvester employs a layered architecture with a core pipeline engine that orchestrates data acquisition through a series of stages: Search, Gather, Check, and Inspect. The Search stage utilizes an advanced Query Optimization Engine, leveraging mathematical modeling and enumeration strategies to efficiently query data sources like GitHub. Subsequent stages acquire detailed information, validate credentials against AI service endpoints, and inspect API capabilities. The framework is built for extensibility, supporting a plugin architecture for new data sources and processors, and features asynchronous, multi-threaded processing with adaptive rate limiting and robust error handling.

Quick Start & Requirements

  • Installation: Clone the repository and install dependencies using pip install -r requirements.txt.
  • Requirements: Python 3.10+. Optional: uvloop for performance on Linux/macOS.
  • Configuration: Generate a default configuration (python main.py --create-config) or copy from examples (examples/config-simple.yaml, examples/config-full.yaml) and edit config.yaml. Key configuration includes GitHub credentials, provider settings, rate limits, and pipeline threads.
  • Running: Execute python main.py with optional configuration flags.
  • Documentation: Refer to the project's README and DeepWiki for detailed guides.

Highlighted Details

  • Advanced Query Optimization Engine: Employs mathematical modeling, splittability analysis, and enumeration strategies to optimize search queries.
  • Multi-Stage Pipeline: Features a configurable 4-stage (Search, Gather, Check, Inspect) processing flow with DAG execution.
  • Extensible Architecture: Supports plugin-style providers and custom pipeline stages for easy integration of new data sources and processors.
  • Enterprise-Ready Features: Includes fault tolerance, state persistence, credential management, rate limiting, and real-time monitoring.

Maintenance & Community

The project is actively maintained, with contributions welcome. Users can contact maintainers via GitHub Issues.

Licensing & Compatibility

Licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). Commercial use is strictly prohibited without explicit written permission.

Limitations & Caveats

This project is intended solely for educational and technical research purposes. Users must comply with all applicable laws, regulations, and third-party terms of service. The authors disclaim responsibility for any misuse, legal issues, or damages. Respect for intellectual property and privacy is paramount.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.