harvester  by wzdnzd

Intelligent data acquisition framework for GitHub and web sources

Created 1 month ago
413 stars

Top 70.8% on SourcePulse

GitHubView on GitHub
Project Summary

Harvester is a universal, adaptive data acquisition framework designed for comprehensive information gathering from diverse sources like GitHub, network mapping platforms (FOFA, Shodan), and arbitrary web endpoints. It targets developers and researchers needing to automate data collection, offering an extensible plugin-based architecture and an intelligent, multi-stage processing pipeline for tasks such as AI service provider key discovery.

How It Works

Harvester employs a layered architecture with a core pipeline engine that orchestrates data acquisition through a series of stages: Search, Gather, Check, and Inspect. The Search stage utilizes an advanced Query Optimization Engine, leveraging mathematical modeling and enumeration strategies to efficiently query data sources like GitHub. Subsequent stages acquire detailed information, validate credentials against AI service endpoints, and inspect API capabilities. The framework is built for extensibility, supporting a plugin architecture for new data sources and processors, and features asynchronous, multi-threaded processing with adaptive rate limiting and robust error handling.

Quick Start & Requirements

  • Installation: Clone the repository and install dependencies using pip install -r requirements.txt.
  • Requirements: Python 3.10+. Optional: uvloop for performance on Linux/macOS.
  • Configuration: Generate a default configuration (python main.py --create-config) or copy from examples (examples/config-simple.yaml, examples/config-full.yaml) and edit config.yaml. Key configuration includes GitHub credentials, provider settings, rate limits, and pipeline threads.
  • Running: Execute python main.py with optional configuration flags.
  • Documentation: Refer to the project's README and DeepWiki for detailed guides.

Highlighted Details

  • Advanced Query Optimization Engine: Employs mathematical modeling, splittability analysis, and enumeration strategies to optimize search queries.
  • Multi-Stage Pipeline: Features a configurable 4-stage (Search, Gather, Check, Inspect) processing flow with DAG execution.
  • Extensible Architecture: Supports plugin-style providers and custom pipeline stages for easy integration of new data sources and processors.
  • Enterprise-Ready Features: Includes fault tolerance, state persistence, credential management, rate limiting, and real-time monitoring.

Maintenance & Community

The project is actively maintained, with contributions welcome. Users can contact maintainers via GitHub Issues.

Licensing & Compatibility

Licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). Commercial use is strictly prohibited without explicit written permission.

Limitations & Caveats

This project is intended solely for educational and technical research purposes. Users must comply with all applicable laws, regulations, and third-party terms of service. The authors disclaim responsibility for any misuse, legal issues, or damages. Respect for intellectual property and privacy is paramount.

Health Check
Last Commit

21 hours ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
317 stars in the last 30 days

Explore Similar Projects

Starred by Elie Bursztein Elie Bursztein(Cybersecurity Lead at Google DeepMind), Michael Chiang Michael Chiang(Cofounder of Ollama), and
2 more.

enrichmcp by featureform

0.3%
611
ORM for AI agents
Created 5 months ago
Updated 1 week ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Gregor Zunic Gregor Zunic(Cofounder of Browser Use), and
2 more.

airweave by airweave-ai

0.5%
3k
Semantic MCP server for AI agents
Created 8 months ago
Updated 16 hours ago
Feedback? Help us improve.