nicar-2025-scraping by simonw

Workshop on web scraping techniques

Created 11 months ago

370 stars

Top 76.4% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jeff Hammerbacher

Cofounder of Cloudera

Project Summary

This repository provides materials for a data journalism workshop on cutting-edge web scraping techniques, targeting journalists with existing scraping experience. It demonstrates advanced methods including video-to-data extraction using LLMs, browser automation with Playwright, and leveraging GitHub Actions for automated scraping.

How It Works

The workshop covers four main areas: Git scraping using GitHub Actions for automated change tracking, in-browser JavaScript and shot-scraper for complex site interaction and data extraction, LLM-based scraping with OpenAI and Google Gemini for unstructured data (including PDFs and images) and structured output, and video scraping by feeding screen recordings into LLMs for data extraction from resistant websites.

Quick Start & Requirements

Install dependencies: pip install -r requirements.txt
Install shot-scraper: shot-scraper install
Requires a GitHub account, a Python environment (GitHub Codespaces recommended), and a Google account for AI Studio.
Links: git-scraper-template, shot-scraper-template

Highlighted Details

Video scraping with Gemini models for extracting structured data from screen recordings.
LLM schema extraction for parsing unstructured text, PDFs, and images into structured formats.
shot-scraper for browser automation, JavaScript execution, and full-page screenshots.
Git scraping via GitHub Actions for continuous monitoring of web resources.

Maintenance & Community

This repository is associated with Simon Willison, a prominent figure in data journalism and tool development. Further collaboration and tool development are encouraged via email.

Licensing & Compatibility

The repository itself does not specify a license. The included tools and techniques may have their own licensing terms. Commercial use of LLM APIs will incur costs.

Limitations & Caveats

The workshop assumes a level of familiarity with web scraping and Python. Some LLM models have rate limits or costs associated with their use. Video scraping with Google AI Studio may involve data usage policies that should be reviewed for confidential information.

nicar-2025-scraping by simonw

Explore Similar Projects

oxylabs-ai-studio-py by oxylabs

llm-reader by m92vyas

mcp by hyperbrowserai

parsera by raznem

thepipe by emcf

entities-extraction-web-scraper by trancethehuman

trafilatura by adbar

llm-scraper by mishushakov

crawlee-python by apify

crawlee by apify

Scrapegraph-ai by ScrapeGraphAI

firecrawl by firecrawl