docbao  by hailoc12

Web crawler for Vietnamese news sites

created 7 years ago
268 stars

Top 96.5% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a specialized Python crawler for collecting news and articles from Vietnamese online newspapers and websites. It's designed for data projects and educational purposes, enabling users to understand Python techniques like OOP and multiprocessing without deep HTML parsing knowledge.

How It Works

The crawler operates on a scheduled, cyclical basis (e.g., every 15 minutes) using crontab. During each cycle, it iterates through configured sites, identifies article links, and extracts data. Collected data is stored in PostgreSQL and can be pushed to Elasticsearch, RabbitMQ, or exported as JSON. Its key advantage is simplifying multi-site crawling and site structure analysis through shareable plain-text configuration files (YAML and XPath), making it accessible to users without extensive parsing experience.

Quick Start & Requirements

  • Install: Clone the repository and run bash install.sh after setting DOCBAO_BASE_DIR in SETTINGS.env.
  • Prerequisites: Ubuntu (16.04+), Windows 10 with WSL, or Raspberry Pi. Requires admin privileges for installation.
  • Setup: Detailed installation steps are provided for Ubuntu and Raspberry Pi.
  • Docs: https://github.com/hailoc12/docbao

Highlighted Details

  • No-code configuration for adding/removing crawl sources via a dedicated tool.
  • Unified configuration language and algorithm for crawling diverse sites, including those using Ajax or requiring login.
  • Integrated anti-blocking techniques and solutions for common crawling issues, developed over 3 years of use on hundreds of Vietnamese sites.
  • Structured data output down to paragraph level, preserving original article formatting.
  • Multiple data export options: PostgreSQL, Elasticsearch, RabbitMQ, API Server, JSON, and a built-in frontend.
  • Includes keyword analysis and trend detection using the underthesea library.

Maintenance & Community

The project is actively developed by Đặng Hải Lộc (hailoc12). It has received community support and contributions, leading to its open-sourcing and use in commercial projects like VnAlert. The author aims to foster an educational and open-source community around the project.

Licensing & Compatibility

The repository does not explicitly state a license in the README. This requires clarification for commercial use or closed-source linking.

Limitations & Caveats

The README mentions that documentation for the config manager tool and the API server is still under development. The project's license is not specified, which could be a barrier for commercial adoption.

Health Check
Last commit

2 years ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 90 days

Explore Similar Projects

Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
7 more.

firecrawl by mendableai

1.9%
44k
API service for turning websites into LLM-ready data
created 1 year ago
updated 1 day ago
Feedback? Help us improve.