scrape-it-now by clemlesne

CLI tool for web scraping and AI indexing

Created 1 year ago

540 stars

Top 58.8% on SourcePulse

Project Summary

This project provides a command-line interface (CLI) web scraper designed for AI and simplicity. It extracts high-quality markdown content, metadata, and images from web pages, with an optional AI indexing capability that chunks content, generates embeddings using OpenAI, and stores them in Azure AI Search. The tool is suitable for researchers, developers, and data scientists needing to process web content for AI applications.

How It Works

The scraper employs a decoupled architecture, leveraging Azure Queue Storage or local SQLite for task management and Azure Blob Storage or local disk for data persistence. It features idempotent operations for parallel execution, avoids re-scraping unchanged pages, and blocks ads. Dynamic content is handled via Playwright and Chromium. The indexer automatically chunks markdown content, generates embeddings via OpenAI, and indexes them into Azure AI Search for semantic search.

Quick Start & Requirements

Installation: python3 -m pip install scrape-it-now
Requirements: Python 3.13+ for source installation. Chromium browser (approx. 450MB) is downloaded automatically. Azure services (Storage, OpenAI, AI Search) or local disk/SQLite configurations are required for full functionality.
Documentation: scrape-it-now scrape run --help

Highlighted Details

Outputs clean markdown content using Pandoc.
Supports parallel scraping and indexing jobs.
Includes options for saving images and page screenshots.
Provides anonymity features like random user agents and viewport sizes.

Maintenance & Community

The project appears to be actively maintained by a single primary contributor. Community interaction channels are not explicitly mentioned in the README.

Licensing & Compatibility

The project is licensed under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The project relies on Chromium, which is not configurable. While local disk storage is available for testing, it is not recommended for production due to scalability and fault-tolerance limitations. Proxy support is not built-in, requiring external configuration for network anonymity.

scrape-it-now by clemlesne

Explore Similar Projects

ask.py by pengfeng

knowledge by raphaelsty

markdown-crawler by paulpierre

mcp by hyperbrowserai

doctor by sisig-ai

LocalRecall by mudler

web-explorer by langchain-ai

tavily-python by tavily-ai

sitefetch by egoist

SearChat by sear-chat

exa-mcp-server by exa-labs

Scrapegraph-ai by ScrapeGraphAI