ocrbase by ocrbase-hq

PDF to structured data extraction API

Created 1 month ago

965 stars

Top 38.1% on SourcePulse

Project Summary

OCRBase provides a self-hostable API for converting PDF documents into structured data (Markdown/JSON) using advanced OCR and LLM techniques. It targets developers and power users needing to process large volumes of documents, offering a scalable, real-time solution with a type-safe TypeScript SDK for seamless integration.

How It Works

This project leverages PaddleOCR-VL-0.9B for high-accuracy text extraction from PDFs, followed by LLM-powered parsing to structure the extracted text according to user-defined schemas. Its architecture is built for scale, employing a queue-based processing system and providing real-time job progress updates via WebSockets, all accessible through a comprehensive, type-safe TypeScript SDK.

Quick Start & Requirements

Primary install: bun add ocrbase
Non-default prerequisites: Docker (for self-hosting), Bun runtime.
Links: SDK documentation and Self-Hosting Guide are referenced but not directly provided in the snippet.

Highlighted Details

Features PaddleOCR-VL-0.9B for best-in-class OCR accuracy.
Enables structured data extraction by defining schemas and receiving JSON output.
Designed for scale with queue-based processing capable of handling thousands of documents.
Offers a type-safe TypeScript SDK, including React hooks, for easy integration.
Provides real-time WebSocket notifications for job progress.
Fully self-hostable for deployment on private infrastructure.

Maintenance & Community

No specific details regarding contributors, sponsorships, or community channels (e.g., Discord, Slack) were present in the provided README snippet.

Licensing & Compatibility

The project is released under the MIT License, which is generally permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

Self-hosting requires familiarity with Docker and the Bun runtime. The primary SDK is TypeScript-focused, which may present a learning curve or integration challenge for teams not using that ecosystem.

ocrbase by ocrbase-hq

Explore Similar Projects

rowfill by harishdeivanayagam

SmartResume by alibaba

documind by DocumindHQ

spacy-layout by explosion

docstrange by NanoNets

pdf-document-layout-analysis by huridocs

docext by NanoNets

ExtractThinker by enoch3712

nlm-ingestor by nlmatics

text-extract-api by CatchTheTornado

kreuzberg by kreuzberg-dev

zerox by getomni-ai