doc-cleaner  by notoriouslab

Structured Markdown generation from diverse documents

Created 2 months ago
254 stars

Top 99.0% on SourcePulse

GitHubView on GitHub
Project Summary

This tool addresses the challenge of converting diverse document formats (PDF, DOCX, XLSX, PPTX, DXF, TXT) into clean, structured Markdown, with a strong emphasis on Traditional Chinese language support, accurate table preservation, and user privacy. It is designed for users who need to process sensitive financial documents or integrate document conversion into AI agent workflows, offering a privacy-first, offline-capable solution.

How It Works

The project employs a multi-faceted approach to document conversion. It intelligently routes PDFs based on content type (native text, broken formatting, scanned images), prioritizing high-quality extraction via optional opendataloader-pdf which directly generates Markdown tables. For other formats like DOCX and XLSX, it leverages libraries like python-docx and pandas to preserve tabular data as Markdown pipe tables. Users can opt for pure extraction (--ai none) or leverage cloud (Gemini, Groq) or local (Ollama) AI models for enhanced structuring and content understanding, with features like ad cleaning and atomic writes ensuring reliable, clean output.

Quick Start & Requirements

  • Primary install: Clone the repository and install core dependencies using pip install -r requirements.txt.
  • Non-default prerequisites: Python 3.9+ is required. Optional dependencies include Java 11+ (for opendataloader-pdf), poppler (for PDF visual mode), and specific AI backend SDKs.
  • Configuration: Copy config.example.json to config.json and .env.example to .env to set AI models, API keys, and other parameters.
  • Run command: python cleaner.py --input <file_or_directory>
  • Links: Official documentation and examples are available within the repository structure.

Highlighted Details

  • Privacy-First AI: Supports local AI inference via Ollama, ensuring sensitive documents never leave the user's computer. A --ai none option provides pure, offline extraction without any AI.
  • Intelligent PDF Processing: Differentiates between native text, broken formatting, and scanned PDFs, with optional opendataloader-pdf for superior table extraction directly into Markdown pipe tables.
  • Robust Table Handling: Explicitly designed to preserve tables from DOCX and XLSX formats as Markdown pipe tables, and guides AI models to maintain existing table structures.
  • Customizable Ad Cleaning: Features configurable regex patterns for removing unwanted boilerplate text from document tails (ad_truncation_patterns) or stripping intermediate paragraphs (ad_strip_patterns), with safety mechanisms to prevent accidental data loss.

Maintenance & Community

This project is part of the notoriouslab open-source toolset, suggesting a broader ecosystem. Contribution guidelines are provided in CONTRIBUTING.md, encouraging users to add ad regex patterns or prompt templates. Specific community links (e.g., Discord, Slack) or details on core maintainers are not explicitly detailed in the provided README.

Licensing & Compatibility

The project is released under the MIT License, which permits commercial use and modification, making it highly compatible with closed-source applications and integration projects.

Limitations & Caveats

High-quality PDF extraction (opendataloader-pdf) and PDF visual mode require external system dependencies (Java 11+, poppler). Older DOC and PPT formats are only supported for text extraction on macOS via textutil. Performance with local AI models like Ollama may be slow on systems with limited resources (e.g., 8GB RAM), recommending cloud AI or --ai none in such cases. AI output requires a specific JSON format, and custom prompts must adhere to this structure.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.