Discover and explore top open-source AI tools and projects—updated daily.
notoriouslabStructured Markdown generation from diverse documents
Top 99.0% on SourcePulse
This tool addresses the challenge of converting diverse document formats (PDF, DOCX, XLSX, PPTX, DXF, TXT) into clean, structured Markdown, with a strong emphasis on Traditional Chinese language support, accurate table preservation, and user privacy. It is designed for users who need to process sensitive financial documents or integrate document conversion into AI agent workflows, offering a privacy-first, offline-capable solution.
How It Works
The project employs a multi-faceted approach to document conversion. It intelligently routes PDFs based on content type (native text, broken formatting, scanned images), prioritizing high-quality extraction via optional opendataloader-pdf which directly generates Markdown tables. For other formats like DOCX and XLSX, it leverages libraries like python-docx and pandas to preserve tabular data as Markdown pipe tables. Users can opt for pure extraction (--ai none) or leverage cloud (Gemini, Groq) or local (Ollama) AI models for enhanced structuring and content understanding, with features like ad cleaning and atomic writes ensuring reliable, clean output.
Quick Start & Requirements
pip install -r requirements.txt.opendataloader-pdf), poppler (for PDF visual mode), and specific AI backend SDKs.config.example.json to config.json and .env.example to .env to set AI models, API keys, and other parameters.python cleaner.py --input <file_or_directory>Highlighted Details
--ai none option provides pure, offline extraction without any AI.opendataloader-pdf for superior table extraction directly into Markdown pipe tables.ad_truncation_patterns) or stripping intermediate paragraphs (ad_strip_patterns), with safety mechanisms to prevent accidental data loss.Maintenance & Community
This project is part of the notoriouslab open-source toolset, suggesting a broader ecosystem. Contribution guidelines are provided in CONTRIBUTING.md, encouraging users to add ad regex patterns or prompt templates. Specific community links (e.g., Discord, Slack) or details on core maintainers are not explicitly detailed in the provided README.
Licensing & Compatibility
The project is released under the MIT License, which permits commercial use and modification, making it highly compatible with closed-source applications and integration projects.
Limitations & Caveats
High-quality PDF extraction (opendataloader-pdf) and PDF visual mode require external system dependencies (Java 11+, poppler). Older DOC and PPT formats are only supported for text extraction on macOS via textutil. Performance with local AI models like Ollama may be slow on systems with limited resources (e.g., 8GB RAM), recommending cloud AI or --ai none in such cases. AI output requires a specific JSON format, and custom prompts must adhere to this structure.
2 weeks ago
Inactive
docling-project