nicar-2025-scraping  by simonw

Workshop on web scraping techniques

created 5 months ago
361 stars

Top 78.8% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides materials for a data journalism workshop on cutting-edge web scraping techniques, targeting journalists with existing scraping experience. It demonstrates advanced methods including video-to-data extraction using LLMs, browser automation with Playwright, and leveraging GitHub Actions for automated scraping.

How It Works

The workshop covers four main areas: Git scraping using GitHub Actions for automated change tracking, in-browser JavaScript and shot-scraper for complex site interaction and data extraction, LLM-based scraping with OpenAI and Google Gemini for unstructured data (including PDFs and images) and structured output, and video scraping by feeding screen recordings into LLMs for data extraction from resistant websites.

Quick Start & Requirements

  • Install dependencies: pip install -r requirements.txt
  • Install shot-scraper: shot-scraper install
  • Requires a GitHub account, a Python environment (GitHub Codespaces recommended), and a Google account for AI Studio.
  • Links: git-scraper-template, shot-scraper-template

Highlighted Details

  • Video scraping with Gemini models for extracting structured data from screen recordings.
  • LLM schema extraction for parsing unstructured text, PDFs, and images into structured formats.
  • shot-scraper for browser automation, JavaScript execution, and full-page screenshots.
  • Git scraping via GitHub Actions for continuous monitoring of web resources.

Maintenance & Community

This repository is associated with Simon Willison, a prominent figure in data journalism and tool development. Further collaboration and tool development are encouraged via email.

Licensing & Compatibility

The repository itself does not specify a license. The included tools and techniques may have their own licensing terms. Commercial use of LLM APIs will incur costs.

Limitations & Caveats

The workshop assumes a level of familiarity with web scraping and Python. Some LLM models have rate limits or costs associated with their use. Video scraping with Google AI Studio may involve data usage policies that should be reviewed for confidential information.

Health Check
Last commit

4 months ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
25 stars in the last 90 days

Explore Similar Projects

Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
7 more.

firecrawl by mendableai

1.9%
44k
API service for turning websites into LLM-ready data
created 1 year ago
updated 1 day ago
Feedback? Help us improve.