Discover and explore top open-source AI tools and projects—updated daily.
scraperaiAI-powered web scraping simplified
Top 78.3% on SourcePulse
ScraperAI is an open-source, AI-powered tool designed to simplify web scraping for users of all skill levels. By leveraging Large Language Models (LLMs) like ChatGPT and GPT-4 Vision, it automates the extraction of data from web pages and generates reusable, shareable scraping recipes. This significantly lowers the barrier to entry for complex web data collection tasks.
How It Works
ScraperAI leverages LLMs and GPT-4 Vision for intelligent data extraction and configuration, automatically detecting page types, catalog items, and pagination. It generates XPaths for static and dynamic fields, simplifying data retrieval. A default Selenium-based web crawler simulates human actions to bypass blocks, with support for PlayWright and requests as alternatives.
Quick Start & Requirements
Installation is straightforward via pip: pip install scraperai. Alternatively, clone the repository and install from source. The CLI application requires an OpenAI API key, which can be configured via environment variables, a .env file, or directly. Examples are available in the /examples folder, with the YCombinator notebook recommended for initial use.
Highlighted Details
Maintenance & Community
The project roadmap includes expanding crawler support to httpx and aiohttp, improving recipe and prompt management, releasing a SaaS web app, and integrating more LLMs like gpt4all. Contributions are welcomed via pull requests and issues.
Licensing & Compatibility
The provided README does not specify the project's license. Users should verify licensing terms before commercial use or integration into closed-source projects.
Limitations & Caveats
ScraperAI currently does not offer solutions for circumventing captchas. The interactive CLI application is limited to OpenAI chat models. The requests package, while supported, has some inherent limitations compared to browser automation tools.
5 months ago
Inactive
hyperbrowserai
adbar
ScrapeGraphAI