docbao by hailoc12

Web crawler for Vietnamese news sites

Created 7 years ago

269 stars

Top 95.6% on SourcePulse

Project Summary

This project provides a specialized Python crawler for collecting news and articles from Vietnamese online newspapers and websites. It's designed for data projects and educational purposes, enabling users to understand Python techniques like OOP and multiprocessing without deep HTML parsing knowledge.

How It Works

The crawler operates on a scheduled, cyclical basis (e.g., every 15 minutes) using crontab. During each cycle, it iterates through configured sites, identifies article links, and extracts data. Collected data is stored in PostgreSQL and can be pushed to Elasticsearch, RabbitMQ, or exported as JSON. Its key advantage is simplifying multi-site crawling and site structure analysis through shareable plain-text configuration files (YAML and XPath), making it accessible to users without extensive parsing experience.

Quick Start & Requirements

Install: Clone the repository and run bash install.sh after setting DOCBAO_BASE_DIR in SETTINGS.env.
Prerequisites: Ubuntu (16.04+), Windows 10 with WSL, or Raspberry Pi. Requires admin privileges for installation.
Setup: Detailed installation steps are provided for Ubuntu and Raspberry Pi.
Docs: https://github.com/hailoc12/docbao

Highlighted Details

No-code configuration for adding/removing crawl sources via a dedicated tool.
Unified configuration language and algorithm for crawling diverse sites, including those using Ajax or requiring login.
Integrated anti-blocking techniques and solutions for common crawling issues, developed over 3 years of use on hundreds of Vietnamese sites.
Structured data output down to paragraph level, preserving original article formatting.
Multiple data export options: PostgreSQL, Elasticsearch, RabbitMQ, API Server, JSON, and a built-in frontend.
Includes keyword analysis and trend detection using the underthesea library.

Maintenance & Community

The project is actively developed by Đặng Hải Lộc (hailoc12). It has received community support and contributions, leading to its open-sourcing and use in commercial projects like VnAlert. The author aims to foster an educational and open-source community around the project.

Licensing & Compatibility

The repository does not explicitly state a license in the README. This requires clarification for commercial use or closed-source linking.

Limitations & Caveats

The README mentions that documentation for the config manager tool and the API server is still under development. The project's license is not specified, which could be a barrier for commercial adoption.

docbao by hailoc12

Explore Similar Projects

LLM_Web_search by mamei16

knowledge by raphaelsty

FLARE by jzbjyb

Craw4LLM by cxcscmu

hacker-news-digest by polyrabbit

ai-journalist by mshumer

tavily-python by tavily-ai

trafilatura by adbar

daily-arXiv-ai-enhanced by dw-dengwei

local-deep-researcher by langchain-ai

wiseflow by TeamWiseFlow

gpt-researcher by assafelovic