gpt-automated-web-scraper  by dirkjbreeuwer

GPT-based tool for automated web scraping

Created 2 years ago
268 stars

Top 95.7% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides an AI-powered web scraper that simplifies data extraction from HTML by generating and executing custom scraping code. It's designed for users who need to extract specific information from websites but want to avoid manual coding of scrapers. The primary benefit is the automation of scraper creation through GPT-4, making web scraping more accessible.

How It Works

The scraper leverages OpenAI's GPT-4 model to interpret user-defined requirements and analyze HTML content. It identifies relevant data subsets using a provided target-string to work within GPT-4's token limits. Based on this analysis, it generates Python code using libraries like BeautifulSoup to perform the actual data extraction, then executes this code to retrieve the desired information.

Quick Start & Requirements

  • Primary install / run command: pip install -r requirements.txt followed by python3 gpt-scraper.py ...
  • Non-default prerequisites: Python 3.x, OpenAI GPT-4 API key.
  • Setup: Clone repo, install requirements, configure .env with API key.
  • Links: Project Repository

Highlighted Details

  • AI-driven scraper code generation using GPT-4.
  • Handles both URL and local HTML file sources.
  • Requires a target-string to focus GPT-4 on relevant HTML sections due to token limits.
  • Outputs extracted data based on user-defined requirements.

Maintenance & Community

No specific information on maintainers, community channels, or roadmap is provided in the README.

Licensing & Compatibility

  • License type: MIT License.
  • Compatibility: Permissive for commercial and closed-source use.

Limitations & Caveats

The project's effectiveness is dependent on the GPT-4 model's ability to accurately interpret requirements and generate correct scraping code. The reliance on a target-string suggests potential limitations in handling highly dynamic or complex website structures where a single string might not reliably isolate the target data.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Li Jiang Li Jiang(Coauthor of AutoGen; Engineer at Microsoft), Jeremy Howard Jeremy Howard(Cofounder of fast.ai), and
2 more.

trafilatura by adbar

0.5%
5k
Python package for web text extraction
Created 6 years ago
Updated 6 days ago
Feedback? Help us improve.