CyberScraper-2077  by itsOwen

Web scraper for AI-powered data extraction

Created 1 year ago
1,770 stars

Top 24.2% on SourcePulse

GitHubView on GitHub
Project Summary

CyberScraper 2077 is an AI-powered web scraping tool designed for extracting data from the internet, including .onion sites via Tor. It targets data analysts, netrunners, and researchers, offering intelligent parsing, multi-format exports, and a user-friendly Streamlit interface.

How It Works

This tool leverages Large Language Models (LLMs) from OpenAI, Gemini, and local Ollama deployments to understand and parse web content intelligently. It employs asynchronous operations for speed and includes features like caching (content-based and query-based LRU) to reduce redundant API calls. Stealth mode parameters and current browser instance usage aim to bypass bot detection.

Quick Start & Requirements

  • Install: Clone the repository, create a virtual environment, pip install -r requirements.txt, and playwright install.
  • Prerequisites: Python 3.10+, OpenAI API key, Gemini API key (optional), Ollama (optional, requires pip install ollama and model download).
  • Docker: Available for easier setup.
  • Docs: YouTube demos available.

Highlighted Details

  • AI-powered extraction using LLMs.
  • Sleek Streamlit GUI.
  • Multi-format export (JSON, CSV, HTML, SQL, Excel).
  • Tor network support for .onion sites.
  • Stealth mode and current browser instance for evasion.
  • Ollama support for local LLMs.
  • Multi-page scraping (BETA) with flexible URL pattern detection.
  • Google Sheets integration.

Maintenance & Community

Licensing & Compatibility

  • MIT License. Permissive for commercial use and closed-source linking.

Limitations & Caveats

  • Multi-page scraping is in BETA and may have issues.
  • Captcha bypass currently only works natively, not in Docker.
  • The "Current Browser" feature is for necessary use only.
  • Tor scraping requires careful legal and ethical consideration.
Health Check
Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
26 stars in the last 30 days

Explore Similar Projects

Starred by Li Jiang Li Jiang(Coauthor of AutoGen; Engineer at Microsoft), Jeremy Howard Jeremy Howard(Cofounder of fast.ai), and
2 more.

trafilatura by adbar

0.5%
5k
Python package for web text extraction
Created 6 years ago
Updated 6 days ago
Feedback? Help us improve.