tap4-ai-crawler  by 6677-ai

Open-source web crawler for AI tool detail extraction

created 1 year ago
279 stars

Top 94.1% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a web crawler that extracts website information, generates screenshots, and uses LLMs to summarize content and create SEO-friendly descriptions. It's designed for individual developers managing AI tool directories and learners interested in Python web scraping and AI integration.

How It Works

The crawler leverages Python for lightweight operation. It fetches titles, descriptions, and introductions from specified URLs. Key functionality includes generating web page screenshots and utilizing LLMs (like Llama 3 or ChatGPT via Groq) to process website introductions and produce summarized, SEO-optimized Markdown descriptions.

Quick Start & Requirements

  • Installation: Clone the repository (git clone https://github.com/6677-ai/tap4-ai-crawler.git) and install dependencies (pip install -r requirements.txt).
  • Prerequisites: Python 3.x, a Groq API key, and S3-compatible object storage (e.g., Cloudflare R2) with credentials (Endpoint URL, Bucket Name, Access Key ID, Secret Access Key, Custom Domain).
  • Running: Configure environment variables in .env and run python main_api.py.
  • Deployment: Instructions are provided for deployment on Zeabur.
  • API Usage: A REST API is exposed; use curl to send POST requests with JSON payloads containing the URL and optional tags.
  • Docs: tap4.ai

Highlighted Details

  • Supports LLM integration for content summarization and description generation.
  • Captures web page screenshots and thumbnails.
  • Exposes a REST API for programmatic access.
  • Offers quick configuration and fast deployment options.

Maintenance & Community

The project is associated with tap4.ai. Contact information includes a Twitter handle (https://x.com/tap4ai) and WeChat contact for inquiries.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README.

Limitations & Caveats

Crawling may fail due to anti-scraping measures, requiring manual checks. LLM output may not always meet expectations and might require prompt optimization or manual review due to anti-scraping. Web scraping requires specific server configurations; paid services like Zeabur with U.S. nodes are recommended for optimal performance.

Health Check
Last commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
19 stars in the last 90 days

Explore Similar Projects

Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
7 more.

firecrawl by mendableai

2.1%
44k
API service for turning websites into LLM-ready data
created 1 year ago
updated 15 hours ago
Feedback? Help us improve.