Scraperr by jaypyles

Self-hosted web scraper for data extraction via XPath

Created 1 year ago

4,796 stars

Top 10.3% on SourcePulse

Project Summary

Scraperr is a self-hosted web application designed for users to extract data from websites using XPath selectors. It offers a user-friendly interface for submitting URLs, defining scrape targets, managing past jobs, and downloading results, with optional AI integration for context-aware data analysis.

How It Works

Scraperr utilizes a queue-based system to manage scraping tasks, allowing users to submit multiple URLs and XPath queries. It supports scraping all pages within the same domain and allows custom JSON headers for requests. Results are displayed in a sortable table, with options to download as CSV and rerun jobs. The application also includes user management for organizing scraping activities and an API powered by FastAPI.

Quick Start & Requirements

Install/run: make deps build up-dev
Prerequisites: MongoDB (requires CPU with AVX support for v5.0+), Python.
AI Integration: Ollama or OpenAI API endpoints.
Documentation: View the docs for a quickstart guide.

Highlighted Details

Self-hosted web application for data scraping.
XPath-based element selection.
Job management: queueing, rerunning, downloading results (CSV).
Optional AI integration with Ollama and OpenAI.
FastAPI-powered API with documentation at /docs.

Maintenance & Community

Development is facilitated by a webapp template.
Contributions are welcome.

Licensing & Compatibility

Licensed under the MIT License.
Permissive for commercial use and closed-source linking.

Limitations & Caveats

MongoDB 5.0+ requires AVX CPU support, which may cause issues in certain virtual machine configurations. Users must ensure compliance with target websites' robots.txt and Terms of Service.

Scraperr by jaypyles

Explore Similar Projects

oxylabs-ai-studio-py by oxylabs

dendrite-python-sdk by dendrite-systems

llm-api-engine by developersdigest

linkedinscraper by cwwmbm

entities-extraction-web-scraper by trancethehuman

AI-Web-Scraper by techwithtim

twitter-api-client by trevorhobenshield

ash by ash-project

google-maps-scraper by gosom

Auto_job_applier_linkedIn by GodsScion

scraping-apis-for-devs by cporter202

Scrapegraph-ai by ScrapeGraphAI