Workshop on web scraping techniques
Top 78.8% on sourcepulse
This repository provides materials for a data journalism workshop on cutting-edge web scraping techniques, targeting journalists with existing scraping experience. It demonstrates advanced methods including video-to-data extraction using LLMs, browser automation with Playwright, and leveraging GitHub Actions for automated scraping.
How It Works
The workshop covers four main areas: Git scraping using GitHub Actions for automated change tracking, in-browser JavaScript and shot-scraper
for complex site interaction and data extraction, LLM-based scraping with OpenAI and Google Gemini for unstructured data (including PDFs and images) and structured output, and video scraping by feeding screen recordings into LLMs for data extraction from resistant websites.
Quick Start & Requirements
pip install -r requirements.txt
shot-scraper
: shot-scraper install
Highlighted Details
shot-scraper
for browser automation, JavaScript execution, and full-page screenshots.Maintenance & Community
This repository is associated with Simon Willison, a prominent figure in data journalism and tool development. Further collaboration and tool development are encouraged via email.
Licensing & Compatibility
The repository itself does not specify a license. The included tools and techniques may have their own licensing terms. Commercial use of LLM APIs will incur costs.
Limitations & Caveats
The workshop assumes a level of familiarity with web scraping and Python. Some LLM models have rate limits or costs associated with their use. Video scraping with Google AI Studio may involve data usage policies that should be reviewed for confidential information.
4 months ago
1+ week