kreuzberg  by Goldziher

Document intelligence framework for Python

Created 10 months ago
2,550 stars

Top 18.2% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Kreuzberg is a Python framework for document intelligence, designed to extract text, metadata, and structured data from a wide array of document formats. It targets developers and researchers needing a unified, high-performance solution for document processing, offering robust capabilities through an extensible API.

How It Works

Kreuzberg unifies document processing by leveraging established open-source libraries like Pandoc for format conversion, PDFium for PDF rendering, and Tesseract for OCR. This approach ensures broad format support and accurate extraction. It features a plugin architecture for custom extractors and provides both synchronous and asynchronous APIs for flexibility in different application contexts.

Quick Start & Requirements

Highlighted Details

  • Supports 18 document types including PDF, Office documents, images, and HTML.
  • Offers OCR integration with Tesseract, EasyOCR, and PaddleOCR, plus table extraction via GMFT.
  • Claims the highest throughput among Python document processing frameworks (30+ docs/second) with a low memory footprint (~360MB runtime).
  • Provides a plugin architecture for custom extractors and includes a REST API via Docker.

Maintenance & Community

The project is maintained by Goldziher. No specific community channels (Discord/Slack) or roadmap links are provided in the README.

Licensing & Compatibility

The project is released under the MIT License, permitting commercial use and integration with closed-source applications.

Limitations & Caveats

The README does not detail specific limitations, unsupported features, or known issues. The performance benchmarks are presented without explicit methodology details, though a link to "detailed analysis" is provided.

Health Check
Last Commit

23 hours ago

Responsiveness

Inactive

Pull Requests (30d)
22
Issues (30d)
10
Star History
77 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.