deepseek-ai-web-crawler  by bhancockio

Web crawler for venue data extraction

created 6 months ago
421 stars

Top 71.0% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a Python-based web crawler designed to extract wedding venue data from websites. It targets users interested in data scraping and leveraging language models for information extraction, offering a structured approach to collecting and exporting venue details to a CSV file.

How It Works

The crawler employs asynchronous programming via the Crawl4AI library for efficient web traversal. Its core innovation lies in using a language model (LLM) for data extraction, allowing for flexible and robust parsing of venue information beyond simple CSS selectors. Extracted data is structured using Pydantic models and saved to a CSV file.

Quick Start & Requirements

  • Install dependencies: pip install -r requirements.txt
  • Requires Python 3.12.
  • An API key for GROQ (GROQ_API_KEY) must be set in a .env file.

Highlighted Details

  • Asynchronous web crawling with Crawl4AI.
  • LLM-powered data extraction strategy.
  • Pydantic models for data structuring.
  • Modular code structure for extensibility.

Maintenance & Community

No specific information on contributors, community channels, or roadmap is provided in the README.

Licensing & Compatibility

The README does not specify a license.

Limitations & Caveats

The project currently uses print statements for logging, which may not be suitable for production environments. The effectiveness of the LLM extraction strategy is dependent on the quality of the prompts and the LLM's capabilities.

Health Check
Last commit

6 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
78 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.