Web crawler for venue data extraction
Top 71.0% on sourcepulse
This project provides a Python-based web crawler designed to extract wedding venue data from websites. It targets users interested in data scraping and leveraging language models for information extraction, offering a structured approach to collecting and exporting venue details to a CSV file.
How It Works
The crawler employs asynchronous programming via the Crawl4AI library for efficient web traversal. Its core innovation lies in using a language model (LLM) for data extraction, allowing for flexible and robust parsing of venue information beyond simple CSS selectors. Extracted data is structured using Pydantic models and saved to a CSV file.
Quick Start & Requirements
pip install -r requirements.txt
GROQ_API_KEY
) must be set in a .env
file.Highlighted Details
Maintenance & Community
No specific information on contributors, community channels, or roadmap is provided in the README.
Licensing & Compatibility
The README does not specify a license.
Limitations & Caveats
The project currently uses print
statements for logging, which may not be suitable for production environments. The effectiveness of the LLM extraction strategy is dependent on the quality of the prompts and the LLM's capabilities.
6 months ago
Inactive