tap4-ai-crawler by 6677-ai

Open-source web crawler for AI tool detail extraction

Created 1 year ago

283 stars

Top 92.4% on SourcePulse

Project Summary

This project provides a web crawler that extracts website information, generates screenshots, and uses LLMs to summarize content and create SEO-friendly descriptions. It's designed for individual developers managing AI tool directories and learners interested in Python web scraping and AI integration.

How It Works

The crawler leverages Python for lightweight operation. It fetches titles, descriptions, and introductions from specified URLs. Key functionality includes generating web page screenshots and utilizing LLMs (like Llama 3 or ChatGPT via Groq) to process website introductions and produce summarized, SEO-optimized Markdown descriptions.

Quick Start & Requirements

Installation: Clone the repository (git clone https://github.com/6677-ai/tap4-ai-crawler.git) and install dependencies (pip install -r requirements.txt).
Prerequisites: Python 3.x, a Groq API key, and S3-compatible object storage (e.g., Cloudflare R2) with credentials (Endpoint URL, Bucket Name, Access Key ID, Secret Access Key, Custom Domain).
Running: Configure environment variables in .env and run python main_api.py.
Deployment: Instructions are provided for deployment on Zeabur.
API Usage: A REST API is exposed; use curl to send POST requests with JSON payloads containing the URL and optional tags.
Docs: tap4.ai

Highlighted Details

Supports LLM integration for content summarization and description generation.
Captures web page screenshots and thumbnails.
Exposes a REST API for programmatic access.
Offers quick configuration and fast deployment options.

Maintenance & Community

The project is associated with tap4.ai. Contact information includes a Twitter handle (https://x.com/tap4ai) and WeChat contact for inquiries.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README.

Limitations & Caveats

Crawling may fail due to anti-scraping measures, requiring manual checks. LLM output may not always meet expectations and might require prompt optimization or manual review due to anti-scraping. Web scraping requires specific server configurations; paid services like Zeabur with U.S. nodes are recommended for optimal performance.

tap4-ai-crawler by 6677-ai

Explore Similar Projects

markdown-crawler by paulpierre

create-llmstxt-py by firecrawl

content-chatbot by mpaepper

mcp by hyperbrowserai

Qmedia by QmiAI

SearChat by sear-chat

deep-research-web-ui by AnotiaWang

trafilatura by adbar

crawlee-python by apify

PandaWiki by chaitin

crawlee by apify

firecrawl by firecrawl