Discover and explore top open-source AI tools and projects—updated daily.
Intelligent data acquisition framework for GitHub and web sources
Top 70.8% on SourcePulse
Harvester is a universal, adaptive data acquisition framework designed for comprehensive information gathering from diverse sources like GitHub, network mapping platforms (FOFA, Shodan), and arbitrary web endpoints. It targets developers and researchers needing to automate data collection, offering an extensible plugin-based architecture and an intelligent, multi-stage processing pipeline for tasks such as AI service provider key discovery.
How It Works
Harvester employs a layered architecture with a core pipeline engine that orchestrates data acquisition through a series of stages: Search, Gather, Check, and Inspect. The Search stage utilizes an advanced Query Optimization Engine, leveraging mathematical modeling and enumeration strategies to efficiently query data sources like GitHub. Subsequent stages acquire detailed information, validate credentials against AI service endpoints, and inspect API capabilities. The framework is built for extensibility, supporting a plugin architecture for new data sources and processors, and features asynchronous, multi-threaded processing with adaptive rate limiting and robust error handling.
Quick Start & Requirements
pip install -r requirements.txt
.uvloop
for performance on Linux/macOS.python main.py --create-config
) or copy from examples (examples/config-simple.yaml
, examples/config-full.yaml
) and edit config.yaml
. Key configuration includes GitHub credentials, provider settings, rate limits, and pipeline threads.python main.py
with optional configuration flags.Highlighted Details
Maintenance & Community
The project is actively maintained, with contributions welcome. Users can contact maintainers via GitHub Issues.
Licensing & Compatibility
Licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). Commercial use is strictly prohibited without explicit written permission.
Limitations & Caveats
This project is intended solely for educational and technical research purposes. Users must comply with all applicable laws, regulations, and third-party terms of service. The authors disclaim responsibility for any misuse, legal issues, or damages. Respect for intellectual property and privacy is paramount.
21 hours ago
Inactive