gpt-automated-web-scraper by djb-gt

GPT-based tool for automated web scraping

Created 2 years ago

270 stars

Top 95.4% on SourcePulse

Project Summary

This project provides an AI-powered web scraper that simplifies data extraction from HTML by generating and executing custom scraping code. It's designed for users who need to extract specific information from websites but want to avoid manual coding of scrapers. The primary benefit is the automation of scraper creation through GPT-4, making web scraping more accessible.

How It Works

The scraper leverages OpenAI's GPT-4 model to interpret user-defined requirements and analyze HTML content. It identifies relevant data subsets using a provided target-string to work within GPT-4's token limits. Based on this analysis, it generates Python code using libraries like BeautifulSoup to perform the actual data extraction, then executes this code to retrieve the desired information.

Quick Start & Requirements

Primary install / run command: pip install -r requirements.txt followed by python3 gpt-scraper.py ...
Non-default prerequisites: Python 3.x, OpenAI GPT-4 API key.
Setup: Clone repo, install requirements, configure .env with API key.
Links: Project Repository

Highlighted Details

AI-driven scraper code generation using GPT-4.
Handles both URL and local HTML file sources.
Requires a target-string to focus GPT-4 on relevant HTML sections due to token limits.
Outputs extracted data based on user-defined requirements.

Maintenance & Community

No specific information on maintainers, community channels, or roadmap is provided in the README.

Licensing & Compatibility

License type: MIT License.
Compatibility: Permissive for commercial and closed-source use.

Limitations & Caveats

The project's effectiveness is dependent on the GPT-4 model's ability to accurately interpret requirements and generate correct scraping code. The reliance on a target-string suggests potential limitations in handling highly dynamic or complex website structures where a single string might not reliably isolate the target data.

gpt-automated-web-scraper by djb-gt

Explore Similar Projects

mcp by hyperbrowserai

parsera by raznem

llm-api-engine by developersdigest

scrapeghost by jamesturk

fetch-mcp by zcaceres

entities-extraction-web-scraper by trancethehuman

web-scraping by je-suis-tm

trafilatura by adbar

llm-scraper by mishushakov

crawlee-python by apify

crawlee by apify

Scrapegraph-ai by ScrapeGraphAI