scraperai  by scraperai

AI-powered web scraping simplified

Created 2 years ago
360 stars

Top 78.3% on SourcePulse

GitHubView on GitHub
Project Summary

ScraperAI is an open-source, AI-powered tool designed to simplify web scraping for users of all skill levels. By leveraging Large Language Models (LLMs) like ChatGPT and GPT-4 Vision, it automates the extraction of data from web pages and generates reusable, shareable scraping recipes. This significantly lowers the barrier to entry for complex web data collection tasks.

How It Works

ScraperAI leverages LLMs and GPT-4 Vision for intelligent data extraction and configuration, automatically detecting page types, catalog items, and pagination. It generates XPaths for static and dynamic fields, simplifying data retrieval. A default Selenium-based web crawler simulates human actions to bypass blocks, with support for PlayWright and requests as alternatives.

Quick Start & Requirements

Installation is straightforward via pip: pip install scraperai. Alternatively, clone the repository and install from source. The CLI application requires an OpenAI API key, which can be configured via environment variables, a .env file, or directly. Examples are available in the /examples folder, with the YCombinator notebook recommended for initial use.

Highlighted Details

  • AI-driven automatic detection of page types, catalog items, data fields, and XPaths.
  • Automated pagination handling for catalog-style pages (XPath, scroll, URLs).
  • Selenium-based web crawler designed to mimic human interaction and avoid detection.
  • Interactive CLI application for guided scraping sessions.

Maintenance & Community

The project roadmap includes expanding crawler support to httpx and aiohttp, improving recipe and prompt management, releasing a SaaS web app, and integrating more LLMs like gpt4all. Contributions are welcomed via pull requests and issues.

Licensing & Compatibility

The provided README does not specify the project's license. Users should verify licensing terms before commercial use or integration into closed-source projects.

Limitations & Caveats

ScraperAI currently does not offer solutions for circumventing captchas. The interactive CLI application is limited to OpenAI chat models. The requests package, while supported, has some inherent limitations compared to browser automation tools.

Health Check
Last Commit

5 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
121 stars in the last 30 days

Explore Similar Projects

Starred by Li Jiang Li Jiang(Coauthor of AutoGen; Engineer at Microsoft), Jeremy Howard Jeremy Howard(Cofounder of fast.ai), and
2 more.

trafilatura by adbar

0.5%
5k
Python package for web text extraction
Created 6 years ago
Updated 5 months ago
Feedback? Help us improve.