CyberScraper-2077  by itsOwen

Web scraper for AI-powered data extraction

created 11 months ago
1,735 stars

Top 25.2% on sourcepulse

GitHubView on GitHub
Project Summary

CyberScraper 2077 is an AI-powered web scraping tool designed for extracting data from the internet, including .onion sites via Tor. It targets data analysts, netrunners, and researchers, offering intelligent parsing, multi-format exports, and a user-friendly Streamlit interface.

How It Works

This tool leverages Large Language Models (LLMs) from OpenAI, Gemini, and local Ollama deployments to understand and parse web content intelligently. It employs asynchronous operations for speed and includes features like caching (content-based and query-based LRU) to reduce redundant API calls. Stealth mode parameters and current browser instance usage aim to bypass bot detection.

Quick Start & Requirements

  • Install: Clone the repository, create a virtual environment, pip install -r requirements.txt, and playwright install.
  • Prerequisites: Python 3.10+, OpenAI API key, Gemini API key (optional), Ollama (optional, requires pip install ollama and model download).
  • Docker: Available for easier setup.
  • Docs: YouTube demos available.

Highlighted Details

  • AI-powered extraction using LLMs.
  • Sleek Streamlit GUI.
  • Multi-format export (JSON, CSV, HTML, SQL, Excel).
  • Tor network support for .onion sites.
  • Stealth mode and current browser instance for evasion.
  • Ollama support for local LLMs.
  • Multi-page scraping (BETA) with flexible URL pattern detection.
  • Google Sheets integration.

Maintenance & Community

Licensing & Compatibility

  • MIT License. Permissive for commercial use and closed-source linking.

Limitations & Caveats

  • Multi-page scraping is in BETA and may have issues.
  • Captcha bypass currently only works natively, not in Docker.
  • The "Current Browser" feature is for necessary use only.
  • Tor scraping requires careful legal and ethical consideration.
Health Check
Last commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
62 stars in the last 90 days

Explore Similar Projects

Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
7 more.

firecrawl by mendableai

2.1%
44k
API service for turning websites into LLM-ready data
created 1 year ago
updated 15 hours ago
Feedback? Help us improve.