scrape-it-now  by clemlesne

CLI tool for web scraping and AI indexing

created 11 months ago
524 stars

Top 61.1% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a command-line interface (CLI) web scraper designed for AI and simplicity. It extracts high-quality markdown content, metadata, and images from web pages, with an optional AI indexing capability that chunks content, generates embeddings using OpenAI, and stores them in Azure AI Search. The tool is suitable for researchers, developers, and data scientists needing to process web content for AI applications.

How It Works

The scraper employs a decoupled architecture, leveraging Azure Queue Storage or local SQLite for task management and Azure Blob Storage or local disk for data persistence. It features idempotent operations for parallel execution, avoids re-scraping unchanged pages, and blocks ads. Dynamic content is handled via Playwright and Chromium. The indexer automatically chunks markdown content, generates embeddings via OpenAI, and indexes them into Azure AI Search for semantic search.

Quick Start & Requirements

  • Installation: python3 -m pip install scrape-it-now
  • Requirements: Python 3.13+ for source installation. Chromium browser (approx. 450MB) is downloaded automatically. Azure services (Storage, OpenAI, AI Search) or local disk/SQLite configurations are required for full functionality.
  • Documentation: scrape-it-now scrape run --help

Highlighted Details

  • Outputs clean markdown content using Pandoc.
  • Supports parallel scraping and indexing jobs.
  • Includes options for saving images and page screenshots.
  • Provides anonymity features like random user agents and viewport sizes.

Maintenance & Community

The project appears to be actively maintained by a single primary contributor. Community interaction channels are not explicitly mentioned in the README.

Licensing & Compatibility

The project is licensed under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The project relies on Chromium, which is not configurable. While local disk storage is available for testing, it is not recommended for production due to scalability and fault-tolerance limitations. Proxy support is not built-in, requiring external configuration for network anonymity.

Health Check
Last commit

5 days ago

Responsiveness

1 day

Pull Requests (30d)
22
Issues (30d)
1
Star History
11 stars in the last 90 days

Explore Similar Projects

Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
7 more.

firecrawl by mendableai

1.9%
44k
API service for turning websites into LLM-ready data
created 1 year ago
updated 1 day ago
Feedback? Help us improve.