CLI tool for web scraping and AI indexing
Top 61.1% on sourcepulse
This project provides a command-line interface (CLI) web scraper designed for AI and simplicity. It extracts high-quality markdown content, metadata, and images from web pages, with an optional AI indexing capability that chunks content, generates embeddings using OpenAI, and stores them in Azure AI Search. The tool is suitable for researchers, developers, and data scientists needing to process web content for AI applications.
How It Works
The scraper employs a decoupled architecture, leveraging Azure Queue Storage or local SQLite for task management and Azure Blob Storage or local disk for data persistence. It features idempotent operations for parallel execution, avoids re-scraping unchanged pages, and blocks ads. Dynamic content is handled via Playwright and Chromium. The indexer automatically chunks markdown content, generates embeddings via OpenAI, and indexes them into Azure AI Search for semantic search.
Quick Start & Requirements
python3 -m pip install scrape-it-now
Highlighted Details
Maintenance & Community
The project appears to be actively maintained by a single primary contributor. Community interaction channels are not explicitly mentioned in the README.
Licensing & Compatibility
The project is licensed under the MIT License, permitting commercial use and integration with closed-source projects.
Limitations & Caveats
The project relies on Chromium, which is not configurable. While local disk storage is available for testing, it is not recommended for production due to scalability and fault-tolerance limitations. Proxy support is not built-in, requiring external configuration for network anonymity.
5 days ago
1 day