gpt-automated-web-scraper  by dirkjbreeuwer

GPT-based tool for automated web scraping

created 2 years ago
266 stars

Top 96.9% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides an AI-powered web scraper that simplifies data extraction from HTML by generating and executing custom scraping code. It's designed for users who need to extract specific information from websites but want to avoid manual coding of scrapers. The primary benefit is the automation of scraper creation through GPT-4, making web scraping more accessible.

How It Works

The scraper leverages OpenAI's GPT-4 model to interpret user-defined requirements and analyze HTML content. It identifies relevant data subsets using a provided target-string to work within GPT-4's token limits. Based on this analysis, it generates Python code using libraries like BeautifulSoup to perform the actual data extraction, then executes this code to retrieve the desired information.

Quick Start & Requirements

  • Primary install / run command: pip install -r requirements.txt followed by python3 gpt-scraper.py ...
  • Non-default prerequisites: Python 3.x, OpenAI GPT-4 API key.
  • Setup: Clone repo, install requirements, configure .env with API key.
  • Links: Project Repository

Highlighted Details

  • AI-driven scraper code generation using GPT-4.
  • Handles both URL and local HTML file sources.
  • Requires a target-string to focus GPT-4 on relevant HTML sections due to token limits.
  • Outputs extracted data based on user-defined requirements.

Maintenance & Community

No specific information on maintainers, community channels, or roadmap is provided in the README.

Licensing & Compatibility

  • License type: MIT License.
  • Compatibility: Permissive for commercial and closed-source use.

Limitations & Caveats

The project's effectiveness is dependent on the GPT-4 model's ability to accurately interpret requirements and generate correct scraping code. The reliance on a target-string suggests potential limitations in handling highly dynamic or complex website structures where a single string might not reliably isolate the target data.

Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 90 days

Explore Similar Projects

Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
7 more.

firecrawl by mendableai

2.1%
44k
API service for turning websites into LLM-ready data
created 1 year ago
updated 16 hours ago
Feedback? Help us improve.