surya by datalab-to

Document intelligence toolkit for OCR, layout, and table recognition

Created 2 years ago

21,131 stars

Top 2.5% on SourcePulse

View on GitHub

17 Experts Love This Project

Amin Ahmad

Cofounder of Vectara

Luis Capelo

Cofounder of Lightning AI

Georgios Konstantopoulos

CTO, General Partner at Paradigm

Dan Guido

Cofounder of Trail of Bits

and 13 more!

Project Summary

Summary

Surya is a 650M parameter Vision-Language Model (VLM) for comprehensive document intelligence, covering OCR, layout analysis, reading order, and table recognition in 90+ languages. It offers high accuracy and speed for researchers and power users, outperforming many larger models in its parameter class.

How It Works

Surya uses a unified VLM for layout analysis, OCR, and table recognition, complemented by a separate text-line detection model. This integrated approach enables efficient processing, outputting structured data like HTML. Deployment is flexible, leveraging external inference backends such as vLLM for GPUs or llama.cpp for CPUs/Apple Silicon.

Quick Start & Requirements

Installation: pip install surya-ocr
Prerequisites:
- Inference Backend: vllm (NVIDIA GPU) or llama.cpp (CPU/Apple Silicon).
- NVIDIA GPU: Docker and NVIDIA Container Toolkit.
- CPU/Apple Silicon: llama-server binary from llama.cpp (e.g., brew install llama.cpp on macOS).
Links:
- llama.cpp releases: https://github.com/ggml-org/llama.cpp/releases
- Interactive Demo: Run surya_gui after pip install streamlit pdftext.

Highlighted Details

Achieves 83.3% on olmOCR-bench, a top performer under 3B parameters.
Delivers 5 pages/second throughput on an RTX 5090.
Scores 87.2% on an internal 91-language benchmark.
Provides detailed layout analysis and reading order.
Supports robust table recognition (rows, columns, cells).
Outputs HTML, with math equations in <math> tags.

Maintenance & Community

Developed by Vikas Paruchuri and the Datalab Team. No specific community channels or roadmap links are provided. Contact hi@datalab.to for fine-tuning assistance.

Licensing & Compatibility

Code is Apache 2.0. Model weights use a modified AI Pubs Open Rail-M license (free for research, personal, and startups <$5M revenue). Broader commercial use requires a separate license via their pricing page.

Limitations & Caveats

Specialized for document intelligence; not optimized for natural scene photos. Core analysis tasks require a running inference backend (vLLM or llama.cpp). Performance can be sensitive to image resolution/quality, potentially needing preprocessing or threshold adjustments.

Health Check

Last Commit

4 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

214 stars in the last 30 days