ocrbase  by majcheradam

PDF to structured data extraction API

Created 1 week ago

New!

632 stars

Top 52.5% on SourcePulse

GitHubView on GitHub
Project Summary

OCRBase provides a self-hostable API for converting PDF documents into structured data (Markdown/JSON) using advanced OCR and LLM techniques. It targets developers and power users needing to process large volumes of documents, offering a scalable, real-time solution with a type-safe TypeScript SDK for seamless integration.

How It Works

This project leverages PaddleOCR-VL-0.9B for high-accuracy text extraction from PDFs, followed by LLM-powered parsing to structure the extracted text according to user-defined schemas. Its architecture is built for scale, employing a queue-based processing system and providing real-time job progress updates via WebSockets, all accessible through a comprehensive, type-safe TypeScript SDK.

Quick Start & Requirements

  • Primary install: bun add ocrbase
  • Non-default prerequisites: Docker (for self-hosting), Bun runtime.
  • Links: SDK documentation and Self-Hosting Guide are referenced but not directly provided in the snippet.

Highlighted Details

  • Features PaddleOCR-VL-0.9B for best-in-class OCR accuracy.
  • Enables structured data extraction by defining schemas and receiving JSON output.
  • Designed for scale with queue-based processing capable of handling thousands of documents.
  • Offers a type-safe TypeScript SDK, including React hooks, for easy integration.
  • Provides real-time WebSocket notifications for job progress.
  • Fully self-hostable for deployment on private infrastructure.

Maintenance & Community

No specific details regarding contributors, sponsorships, or community channels (e.g., Discord, Slack) were present in the provided README snippet.

Licensing & Compatibility

The project is released under the MIT License, which is generally permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

Self-hosting requires familiarity with Docker and the Bun runtime. The primary SDK is TypeScript-focused, which may present a learning curve or integration challenge for teams not using that ecosystem.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
8
Star History
683 stars in the last 7 days

Explore Similar Projects

Feedback? Help us improve.