Scraperr  by jaypyles

Self-hosted web scraper for data extraction via XPath

created 1 year ago
4,175 stars

Top 11.9% on sourcepulse

GitHubView on GitHub
Project Summary

Scraperr is a self-hosted web application designed for users to extract data from websites using XPath selectors. It offers a user-friendly interface for submitting URLs, defining scrape targets, managing past jobs, and downloading results, with optional AI integration for context-aware data analysis.

How It Works

Scraperr utilizes a queue-based system to manage scraping tasks, allowing users to submit multiple URLs and XPath queries. It supports scraping all pages within the same domain and allows custom JSON headers for requests. Results are displayed in a sortable table, with options to download as CSV and rerun jobs. The application also includes user management for organizing scraping activities and an API powered by FastAPI.

Quick Start & Requirements

  • Install/run: make deps build up-dev
  • Prerequisites: MongoDB (requires CPU with AVX support for v5.0+), Python.
  • AI Integration: Ollama or OpenAI API endpoints.
  • Documentation: View the docs for a quickstart guide.

Highlighted Details

  • Self-hosted web application for data scraping.
  • XPath-based element selection.
  • Job management: queueing, rerunning, downloading results (CSV).
  • Optional AI integration with Ollama and OpenAI.
  • FastAPI-powered API with documentation at /docs.

Maintenance & Community

  • Development is facilitated by a webapp template.
  • Contributions are welcome.

Licensing & Compatibility

  • Licensed under the MIT License.
  • Permissive for commercial use and closed-source linking.

Limitations & Caveats

MongoDB 5.0+ requires AVX CPU support, which may cause issues in certain virtual machine configurations. Users must ensure compliance with target websites' robots.txt and Terms of Service.

Health Check
Last commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
2
Star History
2,763 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.