FireRed-OCR  by FireRedTeam

LVLM specialization for pixel-precise structural document parsing

Created 1 month ago
258 stars

Top 98.0% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

FireRed-OCR tackles "structural hallucination" in general Large Vision-Language Models (LVLMs) when processing complex documents. It specializes these models into high-performance, pixel-precise structural document parsing experts. This framework targets engineers, researchers, and power users needing SOTA accuracy and structural integrity in document analysis, offering a significant benefit over models prone to errors like disordered rows or invented formulas.

How It Works

This project shifts from "impressionist" text generation to "structural engineering" by transforming general VLMs into structural experts via a three-stage progressive training strategy: Multi-task Pre-alignment (spatial grounding), Specialized SFT (logical consistency), and Format-Constrained GRPO (RL for self-correction). This approach enforces strict syntactic validity, eliminating common errors like unclosed tables or invalid LaTeX. It also employs a novel "Geometry + Semantics" Data Factory for synthesizing balanced datasets to handle diverse layouts.

Quick Start & Requirements

Installation requires pip install transformers and pip install qwen-vl-utils, followed by cloning the repository from GitHub. Inference involves loading the model and processor from HuggingFace, recommending torch.bfloat16 and flash_attention_2 for acceleration. The framework is based on the Qwen3-VL architecture. Links: HuggingFace, ModelScope, Demo, Technical Report.

Highlighted Details

  • Achieves SOTA performance with a 92.94% overall score on OmniDocBench v1.5, outperforming DeepSeek-OCR 2 and Gemini-3.0 Pro.
  • Ensures structural integrity via Format-Constrained GRPO, eliminating common errors in tables and formulas.
  • Features a novel "Geometry + Semantics" Data Factory for synthesizing balanced datasets, handling long-tail layouts.
  • Demonstrates superior in-the-wild robustness on complex, non-standard layouts (FireRedBench).

Maintenance & Community

Developed by "Super Intelligence Team, Xiaohongshu Inc." with a technical report on arXiv. No specific community channels (e.g., Discord, Slack) or active contributor details are provided.

Licensing & Compatibility

Licensed under Apache 2.0, which is permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

Prohibits use for illegal, defamatory, pornographic, harmful content, or privacy violations; users are solely responsible for misuse. Some benchmarked models are marked as restricted (🔒), indicating potential accessibility limitations.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
3
Star History
40 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Paul Copplestone Paul Copplestone(Cofounder of Supabase), and
4 more.

MegaParse by QuivrHQ

0.1%
7k
File parser optimized for LLM ingestion
Created 1 year ago
Updated 1 year ago
Feedback? Help us improve.