kreuzberg  by Goldziher

Document intelligence framework for Python

created 6 months ago
2,257 stars

Top 20.0% on SourcePulse

GitHubView on GitHub
Project Summary

Kreuzberg is a Python framework for document intelligence, designed to extract text, metadata, and structured data from a wide array of document formats. It targets developers and researchers needing a unified, high-performance solution for document processing, offering robust capabilities through an extensible API.

How It Works

Kreuzberg unifies document processing by leveraging established open-source libraries like Pandoc for format conversion, PDFium for PDF rendering, and Tesseract for OCR. This approach ensures broad format support and accurate extraction. It features a plugin architecture for custom extractors and provides both synchronous and asynchronous APIs for flexibility in different application contexts.

Quick Start & Requirements

Highlighted Details

  • Supports 18 document types including PDF, Office documents, images, and HTML.
  • Offers OCR integration with Tesseract, EasyOCR, and PaddleOCR, plus table extraction via GMFT.
  • Claims the highest throughput among Python document processing frameworks (30+ docs/second) with a low memory footprint (~360MB runtime).
  • Provides a plugin architecture for custom extractors and includes a REST API via Docker.

Maintenance & Community

The project is maintained by Goldziher. No specific community channels (Discord/Slack) or roadmap links are provided in the README.

Licensing & Compatibility

The project is released under the MIT License, permitting commercial use and integration with closed-source applications.

Limitations & Caveats

The README does not detail specific limitations, unsupported features, or known issues. The performance benchmarks are presented without explicit methodology details, though a link to "detailed analysis" is provided.

Health Check
Last commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
11
Issues (30d)
5
Star History
304 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Paul Copplestone Paul Copplestone(Cofounder of Supabase), and
3 more.

MegaParse by QuivrHQ

0.3%
7k
File parser optimized for LLM ingestion
created 1 year ago
updated 5 months ago
Feedback? Help us improve.