scrapeghost by jamesturk

Experimental library for web scraping using OpenAI's GPT API

Created 2 years ago

1,444 stars

Top 28.0% on SourcePulse

View on GitHub

3 Experts Love This Project

Jeff Hammerbacher

Cofounder of Cloudera

Project Summary

This library provides an experimental Python interface for web scraping using OpenAI's GPT API, targeting developers and researchers interested in leveraging LLMs for data extraction. It aims to simplify the process of defining data schemas, cleaning HTML, and validating scraped output, potentially reducing manual effort in complex scraping tasks.

How It Works

Scrapeghost leverages GPT models to extract structured data from web pages. Users define the desired data shape using Python objects, which are then used to prompt the LLM. The library includes features for preprocessing HTML (cleaning, CSS/XPath filtering) and postprocessing the LLM's output (JSON validation, Pydantic schema validation, hallucination checks) to improve accuracy and reliability.

Quick Start & Requirements

Install via pip: pip install scrapeghost
Requires an OpenAI API key.
Python 3.x.
Documentation: https://jamesturk.github.io/scrapeghost/

Highlighted Details

Python-based schema definition for flexible data extraction.
Automated HTML cleaning and pre-filtering with CSS/XPath.
Postprocessing includes JSON and Pydantic validation, plus hallucination checks.
Built-in cost controls, including token tracking, fallbacks (e.g., GPT-3.5-Turbo to GPT-4), and budget limits.

Maintenance & Community

Issues and development tracked on GitHub: https://github.com/jamesturk/scrapeghost/issues

Licensing & Compatibility

License: MIT. Permissive for commercial use and closed-source integration.

Limitations & Caveats

This library is explicitly experimental and can incur significant costs due to API calls (e.g., $0.36 per GPT-4 call on moderately sized pages). Cost estimates are not guaranteed.

Health Check

Last Commit

6 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days