kordoc  by chrisryugj

Korean document parsing and conversion toolkit

Created 2 weeks ago

New!

679 stars

Top 49.7% on SourcePulse

GitHubView on GitHub
Project Summary

Kordoc addresses the challenge of programmatically accessing content within Korean document formats, primarily HWP, HWPX, and PDF. It provides developers and power users with tools to parse, compare, and generate these documents, overcoming the difficulties posed by the proprietary HWP format common in South Korean institutions. The primary benefit is enabling efficient data extraction, analysis, and integration from a historically inaccessible document ecosystem.

How It Works

Kordoc employs a multi-pronged approach for document parsing. For PDFs, it utilizes pdfjs-dist and implements advanced techniques like line-based and cluster-based table detection (analyzing text alignment and graphics commands), XY-Cut reading order, and Korean word-break recovery to handle character-level rendering and cell artifacts. HWP and HWPX files are parsed using their underlying structures: OLE2/CFB for legacy HWP and ZIP/XML for HWPX. The output is standardized into an Intermediate Representation (IRBlock) format, which includes structured data like bounding boxes, styles, and page numbers, facilitating programmatic access beyond simple text extraction.

Quick Start & Requirements

Installation is managed via npm: npm install kordoc. PDF support requires an additional install: npm install pdfjs-dist. Basic usage involves importing and calling functions like parse or compare within a TypeScript/JavaScript environment, as demonstrated in the provided examples. The project also offers a CLI tool (npx kordoc) for direct file conversion and batch processing. Node.js is the primary runtime environment.

Highlighted Details

  • Cross-Format Document Comparison: Enables diffing between HWP and HWPX documents at the IR level.
  • Korean-Specific Parsing: Features like cluster-based table detection, Korean special table pattern recognition (key-value pairs), and word-break recovery are tailored for Korean document nuances.
  • Form Field Extraction: Automatically identifies and extracts label-value pairs from government forms within parsed documents.
  • Markdown to HWPX Conversion: Provides reverse conversion capabilities, generating HWPX files from Markdown input.
  • Pluggable OCR: Supports integration with external OCR services for processing image-based PDFs.
  • Watch Mode & MCP Server: Includes utilities for automated file conversion and a Microservice Communication Protocol (MCP) server for integration into larger systems.

Maintenance & Community

The README does not provide specific details regarding maintainers, community channels (like Discord/Slack), or a public roadmap.

Licensing & Compatibility

Kordoc is released under the MIT License, permitting commercial use and integration into closed-source projects without significant restrictions.

Limitations & Caveats

While robust for its target formats, the project's advanced PDF parsing and specific Korean document features may not be directly applicable or as effective for non-Korean documents. OCR functionality is pluggable and requires users to provide their own OCR service implementation. The README does not detail specific performance benchmarks or known limitations beyond the scope of its supported formats.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
11
Issues (30d)
8
Star History
689 stars in the last 14 days

Explore Similar Projects

Starred by Travis Fischer Travis Fischer(Founder of Agentic), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
2 more.

MinerU by opendatalab

2.0%
59k
PDF extraction tool for converting PDFs to Markdown and JSON
Created 2 years ago
Updated 2 days ago
Feedback? Help us improve.