scrapeghost  by jamesturk

Experimental library for web scraping using OpenAI's GPT API

Created 2 years ago
1,442 stars

Top 28.3% on SourcePulse

GitHubView on GitHub
Project Summary

This library provides an experimental Python interface for web scraping using OpenAI's GPT API, targeting developers and researchers interested in leveraging LLMs for data extraction. It aims to simplify the process of defining data schemas, cleaning HTML, and validating scraped output, potentially reducing manual effort in complex scraping tasks.

How It Works

Scrapeghost leverages GPT models to extract structured data from web pages. Users define the desired data shape using Python objects, which are then used to prompt the LLM. The library includes features for preprocessing HTML (cleaning, CSS/XPath filtering) and postprocessing the LLM's output (JSON validation, Pydantic schema validation, hallucination checks) to improve accuracy and reliability.

Quick Start & Requirements

Highlighted Details

  • Python-based schema definition for flexible data extraction.
  • Automated HTML cleaning and pre-filtering with CSS/XPath.
  • Postprocessing includes JSON and Pydantic validation, plus hallucination checks.
  • Built-in cost controls, including token tracking, fallbacks (e.g., GPT-3.5-Turbo to GPT-4), and budget limits.

Maintenance & Community

Licensing & Compatibility

  • License: MIT. Permissive for commercial use and closed-source integration.

Limitations & Caveats

This library is explicitly experimental and can incur significant costs due to API calls (e.g., $0.36 per GPT-4 call on moderately sized pages). Cost estimates are not guaranteed.

Health Check
Last Commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Li Jiang Li Jiang(Coauthor of AutoGen; Engineer at Microsoft), Jeremy Howard Jeremy Howard(Cofounder of fast.ai), and
2 more.

trafilatura by adbar

0.5%
5k
Python package for web text extraction
Created 6 years ago
Updated 6 days ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Jerry Liu Jerry Liu(Cofounder of LlamaIndex), and
1 more.

sparrow by katanaml

0.1%
5k
Data processing & instruction calling tool using ML, LLM, and Vision LLM
Created 3 years ago
Updated 1 day ago
Feedback? Help us improve.