Experimental library for web scraping using OpenAI's GPT API
Top 28.9% on sourcepulse
This library provides an experimental Python interface for web scraping using OpenAI's GPT API, targeting developers and researchers interested in leveraging LLMs for data extraction. It aims to simplify the process of defining data schemas, cleaning HTML, and validating scraped output, potentially reducing manual effort in complex scraping tasks.
How It Works
Scrapeghost leverages GPT models to extract structured data from web pages. Users define the desired data shape using Python objects, which are then used to prompt the LLM. The library includes features for preprocessing HTML (cleaning, CSS/XPath filtering) and postprocessing the LLM's output (JSON validation, Pydantic schema validation, hallucination checks) to improve accuracy and reliability.
Quick Start & Requirements
pip install scrapeghost
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
This library is explicitly experimental and can incur significant costs due to API calls (e.g., $0.36 per GPT-4 call on moderately sized pages). Cost estimates are not guaranteed.
1 month ago
Inactive