deepseek-ai-web-crawler by bhancockio

Web crawler for venue data extraction

Created 9 months ago

475 stars

Top 64.3% on SourcePulse

Project Summary

This project provides a Python-based web crawler designed to extract wedding venue data from websites. It targets users interested in data scraping and leveraging language models for information extraction, offering a structured approach to collecting and exporting venue details to a CSV file.

How It Works

The crawler employs asynchronous programming via the Crawl4AI library for efficient web traversal. Its core innovation lies in using a language model (LLM) for data extraction, allowing for flexible and robust parsing of venue information beyond simple CSS selectors. Extracted data is structured using Pydantic models and saved to a CSV file.

Quick Start & Requirements

Install dependencies: pip install -r requirements.txt
Requires Python 3.12.
An API key for GROQ (GROQ_API_KEY) must be set in a .env file.

Highlighted Details

Asynchronous web crawling with Crawl4AI.
LLM-powered data extraction strategy.
Pydantic models for data structuring.
Modular code structure for extensibility.

Maintenance & Community

No specific information on contributors, community channels, or roadmap is provided in the README.

Licensing & Compatibility

The README does not specify a license.

Limitations & Caveats

The project currently uses print statements for logging, which may not be suitable for production environments. The effectiveness of the LLM extraction strategy is dependent on the quality of the prompts and the LLM's capabilities.

deepseek-ai-web-crawler by bhancockio

Explore Similar Projects

ai-crawler-py by oxylabs

deepscrape by stretchcloud

nicar-2025-scraping by simonw

Craw4LLM by cxcscmu

thepipe by emcf

tavily-python by tavily-ai

entities-extraction-web-scraper by trancethehuman

docbao by hailoc12

WaterCrawl by watercrawl

trafilatura by adbar

wiseflow by TeamWiseFlow

firecrawl by firecrawl