Discover and explore top open-source AI tools and projects—updated daily.
chrisryugjKorean document parsing and conversion toolkit
New!
Top 49.7% on SourcePulse
Kordoc addresses the challenge of programmatically accessing content within Korean document formats, primarily HWP, HWPX, and PDF. It provides developers and power users with tools to parse, compare, and generate these documents, overcoming the difficulties posed by the proprietary HWP format common in South Korean institutions. The primary benefit is enabling efficient data extraction, analysis, and integration from a historically inaccessible document ecosystem.
How It Works
Kordoc employs a multi-pronged approach for document parsing. For PDFs, it utilizes pdfjs-dist and implements advanced techniques like line-based and cluster-based table detection (analyzing text alignment and graphics commands), XY-Cut reading order, and Korean word-break recovery to handle character-level rendering and cell artifacts. HWP and HWPX files are parsed using their underlying structures: OLE2/CFB for legacy HWP and ZIP/XML for HWPX. The output is standardized into an Intermediate Representation (IRBlock) format, which includes structured data like bounding boxes, styles, and page numbers, facilitating programmatic access beyond simple text extraction.
Quick Start & Requirements
Installation is managed via npm: npm install kordoc. PDF support requires an additional install: npm install pdfjs-dist. Basic usage involves importing and calling functions like parse or compare within a TypeScript/JavaScript environment, as demonstrated in the provided examples. The project also offers a CLI tool (npx kordoc) for direct file conversion and batch processing. Node.js is the primary runtime environment.
Highlighted Details
Maintenance & Community
The README does not provide specific details regarding maintainers, community channels (like Discord/Slack), or a public roadmap.
Licensing & Compatibility
Kordoc is released under the MIT License, permitting commercial use and integration into closed-source projects without significant restrictions.
Limitations & Caveats
While robust for its target formats, the project's advanced PDF parsing and specific Korean document features may not be directly applicable or as effective for non-Korean documents. OCR functionality is pluggable and requires users to provide their own OCR service implementation. The README does not detail specific performance benchmarks or known limitations beyond the scope of its supported formats.
1 day ago
Inactive
opendatalab