kordoc by chrisryugj

Korean document parsing and conversion toolkit

Created 3 months ago

1,382 stars

Top 28.5% on SourcePulse

Project Summary

Kordoc addresses the challenge of programmatically accessing content within Korean document formats, primarily HWP, HWPX, and PDF. It provides developers and power users with tools to parse, compare, and generate these documents, overcoming the difficulties posed by the proprietary HWP format common in South Korean institutions. The primary benefit is enabling efficient data extraction, analysis, and integration from a historically inaccessible document ecosystem.

How It Works

Kordoc employs a multi-pronged approach for document parsing. For PDFs, it utilizes pdfjs-dist and implements advanced techniques like line-based and cluster-based table detection (analyzing text alignment and graphics commands), XY-Cut reading order, and Korean word-break recovery to handle character-level rendering and cell artifacts. HWP and HWPX files are parsed using their underlying structures: OLE2/CFB for legacy HWP and ZIP/XML for HWPX. The output is standardized into an Intermediate Representation (IRBlock) format, which includes structured data like bounding boxes, styles, and page numbers, facilitating programmatic access beyond simple text extraction.

Quick Start & Requirements

Installation is managed via npm: npm install kordoc. PDF support requires an additional install: npm install pdfjs-dist. Basic usage involves importing and calling functions like parse or compare within a TypeScript/JavaScript environment, as demonstrated in the provided examples. The project also offers a CLI tool (npx kordoc) for direct file conversion and batch processing. Node.js is the primary runtime environment.

Highlighted Details

Cross-Format Document Comparison: Enables diffing between HWP and HWPX documents at the IR level.
Korean-Specific Parsing: Features like cluster-based table detection, Korean special table pattern recognition (key-value pairs), and word-break recovery are tailored for Korean document nuances.
Form Field Extraction: Automatically identifies and extracts label-value pairs from government forms within parsed documents.
Markdown to HWPX Conversion: Provides reverse conversion capabilities, generating HWPX files from Markdown input.
Pluggable OCR: Supports integration with external OCR services for processing image-based PDFs.
Watch Mode & MCP Server: Includes utilities for automated file conversion and a Microservice Communication Protocol (MCP) server for integration into larger systems.

Maintenance & Community

The README does not provide specific details regarding maintainers, community channels (like Discord/Slack), or a public roadmap.

Licensing & Compatibility

Kordoc is released under the MIT License, permitting commercial use and integration into closed-source projects without significant restrictions.

Limitations & Caveats

While robust for its target formats, the project's advanced PDF parsing and specific Korean document features may not be directly applicable or as effective for non-Korean documents. OCR functionality is pluggable and requires users to provide their own OCR service implementation. The README does not detail specific performance benchmarks or known limitations beyond the scope of its supported formats.

kordoc by chrisryugj

Explore Similar Projects

ParseStudio by chatclimate-ai

Versatile-OCR-Program by raphael-seo

pdfmd by M1ck4

DeepSeek-OCR-Web by fufankeji

PolyglotPDF by CBIhalsen

nlm-ingestor by nlmatics

OnnxOCR by jingsongliujing

pdf-craft by oomol-lab

invoice2data by invoice-x

PyMuPDF by pymupdf

liteparse by run-llama

MinerU by opendatalab