Web crawler for Vietnamese news sites
Top 96.5% on sourcepulse
This project provides a specialized Python crawler for collecting news and articles from Vietnamese online newspapers and websites. It's designed for data projects and educational purposes, enabling users to understand Python techniques like OOP and multiprocessing without deep HTML parsing knowledge.
How It Works
The crawler operates on a scheduled, cyclical basis (e.g., every 15 minutes) using crontab. During each cycle, it iterates through configured sites, identifies article links, and extracts data. Collected data is stored in PostgreSQL and can be pushed to Elasticsearch, RabbitMQ, or exported as JSON. Its key advantage is simplifying multi-site crawling and site structure analysis through shareable plain-text configuration files (YAML and XPath), making it accessible to users without extensive parsing experience.
Quick Start & Requirements
bash install.sh
after setting DOCBAO_BASE_DIR
in SETTINGS.env
.Highlighted Details
underthesea
library.Maintenance & Community
The project is actively developed by Đặng Hải Lộc (hailoc12). It has received community support and contributions, leading to its open-sourcing and use in commercial projects like VnAlert. The author aims to foster an educational and open-source community around the project.
Licensing & Compatibility
The repository does not explicitly state a license in the README. This requires clarification for commercial use or closed-source linking.
Limitations & Caveats
The README mentions that documentation for the config manager tool and the API server is still under development. The project's license is not specified, which could be a barrier for commercial adoption.
2 years ago
1+ week