Discover and explore top open-source AI tools and projects—updated daily.
FireRedTeamLVLM specialization for pixel-precise structural document parsing
Top 98.0% on SourcePulse
Summary
FireRed-OCR tackles "structural hallucination" in general Large Vision-Language Models (LVLMs) when processing complex documents. It specializes these models into high-performance, pixel-precise structural document parsing experts. This framework targets engineers, researchers, and power users needing SOTA accuracy and structural integrity in document analysis, offering a significant benefit over models prone to errors like disordered rows or invented formulas.
How It Works
This project shifts from "impressionist" text generation to "structural engineering" by transforming general VLMs into structural experts via a three-stage progressive training strategy: Multi-task Pre-alignment (spatial grounding), Specialized SFT (logical consistency), and Format-Constrained GRPO (RL for self-correction). This approach enforces strict syntactic validity, eliminating common errors like unclosed tables or invalid LaTeX. It also employs a novel "Geometry + Semantics" Data Factory for synthesizing balanced datasets to handle diverse layouts.
Quick Start & Requirements
Installation requires pip install transformers and pip install qwen-vl-utils, followed by cloning the repository from GitHub. Inference involves loading the model and processor from HuggingFace, recommending torch.bfloat16 and flash_attention_2 for acceleration. The framework is based on the Qwen3-VL architecture.
Links: HuggingFace, ModelScope, Demo, Technical Report.
Highlighted Details
Maintenance & Community
Developed by "Super Intelligence Team, Xiaohongshu Inc." with a technical report on arXiv. No specific community channels (e.g., Discord, Slack) or active contributor details are provided.
Licensing & Compatibility
Licensed under Apache 2.0, which is permissive for commercial use and integration into closed-source projects.
Limitations & Caveats
Prohibits use for illegal, defamatory, pornographic, harmful content, or privacy violations; users are solely responsible for misuse. Some benchmarked models are marked as restricted (🔒), indicating potential accessibility limitations.
1 month ago
Inactive
QuivrHQ
rednote-hilab