kreuzberg by kreuzberg-dev

Document intelligence framework for Python

Created 1 year ago

6,670 stars

Top 7.6% on SourcePulse

View on GitHub

2 Experts Love This Project

Wes McKinney

Author of Pandas

Luis Capelo

Cofounder of Lightning AI

Project Summary

Kreuzberg is a Python framework for document intelligence, designed to extract text, metadata, and structured data from a wide array of document formats. It targets developers and researchers needing a unified, high-performance solution for document processing, offering robust capabilities through an extensible API.

How It Works

Kreuzberg unifies document processing by leveraging established open-source libraries like Pandoc for format conversion, PDFium for PDF rendering, and Tesseract for OCR. This approach ensures broad format support and accurate extraction. It features a plugin architecture for custom extractors and provides both synchronous and asynchronous APIs for flexibility in different application contexts.

Quick Start & Requirements

Install: pip install kreuzberg or pip install kreuzberg[all] for full features.
Prerequisites: No specific hardware or OS requirements are listed beyond standard Python environments. Docker is available for deployment.
Links: Installation Guide, CLI Documentation, API Reference, Docker Guide.

Highlighted Details

Supports 18 document types including PDF, Office documents, images, and HTML.
Offers OCR integration with Tesseract, EasyOCR, and PaddleOCR, plus table extraction via GMFT.
Claims the highest throughput among Python document processing frameworks (30+ docs/second) with a low memory footprint (~360MB runtime).
Provides a plugin architecture for custom extractors and includes a REST API via Docker.

Maintenance & Community

The project is maintained by Goldziher. No specific community channels (Discord/Slack) or roadmap links are provided in the README.

Licensing & Compatibility

The project is released under the MIT License, permitting commercial use and integration with closed-source applications.

Limitations & Caveats

The README does not detail specific limitations, unsupported features, or known issues. The performance benchmarks are presented without explicit methodology details, though a link to "detailed analysis" is provided.

Health Check

Last Commit

20 hours ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

914 stars in the last 30 days