scrapeghost  by jamesturk

Experimental library for web scraping using OpenAI's GPT API

created 2 years ago
1,441 stars

Top 28.9% on sourcepulse

GitHubView on GitHub
Project Summary

This library provides an experimental Python interface for web scraping using OpenAI's GPT API, targeting developers and researchers interested in leveraging LLMs for data extraction. It aims to simplify the process of defining data schemas, cleaning HTML, and validating scraped output, potentially reducing manual effort in complex scraping tasks.

How It Works

Scrapeghost leverages GPT models to extract structured data from web pages. Users define the desired data shape using Python objects, which are then used to prompt the LLM. The library includes features for preprocessing HTML (cleaning, CSS/XPath filtering) and postprocessing the LLM's output (JSON validation, Pydantic schema validation, hallucination checks) to improve accuracy and reliability.

Quick Start & Requirements

Highlighted Details

  • Python-based schema definition for flexible data extraction.
  • Automated HTML cleaning and pre-filtering with CSS/XPath.
  • Postprocessing includes JSON and Pydantic validation, plus hallucination checks.
  • Built-in cost controls, including token tracking, fallbacks (e.g., GPT-3.5-Turbo to GPT-4), and budget limits.

Maintenance & Community

Licensing & Compatibility

  • License: MIT. Permissive for commercial use and closed-source integration.

Limitations & Caveats

This library is explicitly experimental and can incur significant costs due to API calls (e.g., $0.36 per GPT-4 call on moderately sized pages). Cost estimates are not guaranteed.

Health Check
Last commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
11 stars in the last 90 days

Explore Similar Projects

Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
7 more.

firecrawl by mendableai

1.9%
44k
API service for turning websites into LLM-ready data
created 1 year ago
updated 1 day ago
Feedback? Help us improve.