surya  by datalab-to

Document intelligence toolkit for OCR, layout, and table recognition

Created 2 years ago
20,776 stars

Top 2.6% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

Surya is a 650M parameter Vision-Language Model (VLM) for comprehensive document intelligence, covering OCR, layout analysis, reading order, and table recognition in 90+ languages. It offers high accuracy and speed for researchers and power users, outperforming many larger models in its parameter class.

How It Works

Surya uses a unified VLM for layout analysis, OCR, and table recognition, complemented by a separate text-line detection model. This integrated approach enables efficient processing, outputting structured data like HTML. Deployment is flexible, leveraging external inference backends such as vLLM for GPUs or llama.cpp for CPUs/Apple Silicon.

Quick Start & Requirements

  • Installation: pip install surya-ocr
  • Prerequisites:
    • Inference Backend: vllm (NVIDIA GPU) or llama.cpp (CPU/Apple Silicon).
    • NVIDIA GPU: Docker and NVIDIA Container Toolkit.
    • CPU/Apple Silicon: llama-server binary from llama.cpp (e.g., brew install llama.cpp on macOS).
  • Links:

Highlighted Details

  • Achieves 83.3% on olmOCR-bench, a top performer under 3B parameters.
  • Delivers 5 pages/second throughput on an RTX 5090.
  • Scores 87.2% on an internal 91-language benchmark.
  • Provides detailed layout analysis and reading order.
  • Supports robust table recognition (rows, columns, cells).
  • Outputs HTML, with math equations in <math> tags.

Maintenance & Community

Developed by Vikas Paruchuri and the Datalab Team. No specific community channels or roadmap links are provided. Contact hi@datalab.to for fine-tuning assistance.

Licensing & Compatibility

Code is Apache 2.0. Model weights use a modified AI Pubs Open Rail-M license (free for research, personal, and startups <$5M revenue). Broader commercial use requires a separate license via their pricing page.

Limitations & Caveats

Specialized for document intelligence; not optimized for natural scene photos. Core analysis tasks require a running inference backend (vLLM or llama.cpp). Performance can be sensitive to image resolution/quality, potentially needing preprocessing or threshold adjustments.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
6
Star History
1,079 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.