FireRed-OCR by FireRedTeam

LVLM specialization for pixel-precise structural document parsing

Created 4 months ago

287 stars

Top 91.3% on SourcePulse

Project Summary

Summary

FireRed-OCR tackles "structural hallucination" in general Large Vision-Language Models (LVLMs) when processing complex documents. It specializes these models into high-performance, pixel-precise structural document parsing experts. This framework targets engineers, researchers, and power users needing SOTA accuracy and structural integrity in document analysis, offering a significant benefit over models prone to errors like disordered rows or invented formulas.

How It Works

This project shifts from "impressionist" text generation to "structural engineering" by transforming general VLMs into structural experts via a three-stage progressive training strategy: Multi-task Pre-alignment (spatial grounding), Specialized SFT (logical consistency), and Format-Constrained GRPO (RL for self-correction). This approach enforces strict syntactic validity, eliminating common errors like unclosed tables or invalid LaTeX. It also employs a novel "Geometry + Semantics" Data Factory for synthesizing balanced datasets to handle diverse layouts.

Quick Start & Requirements

Installation requires pip install transformers and pip install qwen-vl-utils, followed by cloning the repository from GitHub. Inference involves loading the model and processor from HuggingFace, recommending torch.bfloat16 and flash_attention_2 for acceleration. The framework is based on the Qwen3-VL architecture. Links: HuggingFace, ModelScope, Demo, Technical Report.

Highlighted Details

Achieves SOTA performance with a 92.94% overall score on OmniDocBench v1.5, outperforming DeepSeek-OCR 2 and Gemini-3.0 Pro.
Ensures structural integrity via Format-Constrained GRPO, eliminating common errors in tables and formulas.
Features a novel "Geometry + Semantics" Data Factory for synthesizing balanced datasets, handling long-tail layouts.
Demonstrates superior in-the-wild robustness on complex, non-standard layouts (FireRedBench).

Maintenance & Community

Developed by "Super Intelligence Team, Xiaohongshu Inc." with a technical report on arXiv. No specific community channels (e.g., Discord, Slack) or active contributor details are provided.

Licensing & Compatibility

Licensed under Apache 2.0, which is permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

Prohibits use for illegal, defamatory, pornographic, harmful content, or privacy violations; users are solely responsible for misuse. Some benchmarked models are marked as restricted (🔒), indicating potential accessibility limitations.

FireRed-OCR by FireRedTeam

Explore Similar Projects

dots.mocr by rednote-hilab

AWESOME-OCR-LLM by Yuliang-Liu

SmartResume by alibaba

VisRAG by OpenBMB

Logics-Parsing by alibaba

HunyuanOCR by Tencent-Hunyuan

MegaParse by QuivrHQ

MonkeyOCR by Yuliang-Liu

GLM-OCR by zai-org

Dolphin by bytedance

dots.ocr by rednote-hilab

Unlimited-OCR by baidu