doc-cleaner by notoriouslab

Structured Markdown generation from diverse documents

Created 4 months ago

295 stars

Top 89.5% on SourcePulse

Project Summary

This tool addresses the challenge of converting diverse document formats (PDF, DOCX, XLSX, PPTX, DXF, TXT) into clean, structured Markdown, with a strong emphasis on Traditional Chinese language support, accurate table preservation, and user privacy. It is designed for users who need to process sensitive financial documents or integrate document conversion into AI agent workflows, offering a privacy-first, offline-capable solution.

How It Works

The project employs a multi-faceted approach to document conversion. It intelligently routes PDFs based on content type (native text, broken formatting, scanned images), prioritizing high-quality extraction via optional opendataloader-pdf which directly generates Markdown tables. For other formats like DOCX and XLSX, it leverages libraries like python-docx and pandas to preserve tabular data as Markdown pipe tables. Users can opt for pure extraction (--ai none) or leverage cloud (Gemini, Groq) or local (Ollama) AI models for enhanced structuring and content understanding, with features like ad cleaning and atomic writes ensuring reliable, clean output.

Quick Start & Requirements

Primary install: Clone the repository and install core dependencies using pip install -r requirements.txt.
Non-default prerequisites: Python 3.9+ is required. Optional dependencies include Java 11+ (for opendataloader-pdf), poppler (for PDF visual mode), and specific AI backend SDKs.
Configuration: Copy config.example.json to config.json and .env.example to .env to set AI models, API keys, and other parameters.
Run command: python cleaner.py --input <file_or_directory>
Links: Official documentation and examples are available within the repository structure.

Highlighted Details

Privacy-First AI: Supports local AI inference via Ollama, ensuring sensitive documents never leave the user's computer. A --ai none option provides pure, offline extraction without any AI.
Intelligent PDF Processing: Differentiates between native text, broken formatting, and scanned PDFs, with optional opendataloader-pdf for superior table extraction directly into Markdown pipe tables.
Robust Table Handling: Explicitly designed to preserve tables from DOCX and XLSX formats as Markdown pipe tables, and guides AI models to maintain existing table structures.
Customizable Ad Cleaning: Features configurable regex patterns for removing unwanted boilerplate text from document tails (ad_truncation_patterns) or stripping intermediate paragraphs (ad_strip_patterns), with safety mechanisms to prevent accidental data loss.

Maintenance & Community

This project is part of the notoriouslab open-source toolset, suggesting a broader ecosystem. Contribution guidelines are provided in CONTRIBUTING.md, encouraging users to add ad regex patterns or prompt templates. Specific community links (e.g., Discord, Slack) or details on core maintainers are not explicitly detailed in the provided README.

Licensing & Compatibility

The project is released under the MIT License, which permits commercial use and modification, making it highly compatible with closed-source applications and integration projects.

Limitations & Caveats

High-quality PDF extraction (opendataloader-pdf) and PDF visual mode require external system dependencies (Java 11+, poppler). Older DOC and PPT formats are only supported for text extraction on macOS via textutil. Performance with local AI models like Ollama may be slow on systems with limited resources (e.g., 8GB RAM), recommending cloud AI or --ai none in such cases. AI output requires a specific JSON format, and custom prompts must adhere to this structure.

doc-cleaner by notoriouslab

Explore Similar Projects

ParseStudio by chatclimate-ai

md2html by haidang1810

pdfmd by M1ck4

SmartResume by alibaba

scholaraio by ZimoLiao

ParseBench by run-llama

markpdfdown by MarkPDFdown

lennys-newsletterpodcastdata by LennysNewsletter

PasteMD by RICHQAQ

PyMuPDF by pymupdf

opendataloader-pdf by opendataloader-project

docling by docling-project