kreuzberg  by Goldziher

Document intelligence framework for Python

Created 8 months ago
2,430 stars

Top 18.9% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Kreuzberg is a Python framework for document intelligence, designed to extract text, metadata, and structured data from a wide array of document formats. It targets developers and researchers needing a unified, high-performance solution for document processing, offering robust capabilities through an extensible API.

How It Works

Kreuzberg unifies document processing by leveraging established open-source libraries like Pandoc for format conversion, PDFium for PDF rendering, and Tesseract for OCR. This approach ensures broad format support and accurate extraction. It features a plugin architecture for custom extractors and provides both synchronous and asynchronous APIs for flexibility in different application contexts.

Quick Start & Requirements

Highlighted Details

  • Supports 18 document types including PDF, Office documents, images, and HTML.
  • Offers OCR integration with Tesseract, EasyOCR, and PaddleOCR, plus table extraction via GMFT.
  • Claims the highest throughput among Python document processing frameworks (30+ docs/second) with a low memory footprint (~360MB runtime).
  • Provides a plugin architecture for custom extractors and includes a REST API via Docker.

Maintenance & Community

The project is maintained by Goldziher. No specific community channels (Discord/Slack) or roadmap links are provided in the README.

Licensing & Compatibility

The project is released under the MIT License, permitting commercial use and integration with closed-source applications.

Limitations & Caveats

The README does not detail specific limitations, unsupported features, or known issues. The performance benchmarks are presented without explicit methodology details, though a link to "detailed analysis" is provided.

Health Check
Last Commit

3 hours ago

Responsiveness

Inactive

Pull Requests (30d)
27
Issues (30d)
8
Star History
90 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.