Qianfan-VL  by baidubce

Vision-language models for enterprise document understanding and reasoning

Created 6 months ago
356 stars

Top 78.5% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

Qianfan-VL is a family of domain-enhanced vision-language models designed for enterprise-grade visual understanding. It offers optimized solutions for high-frequency industrial deployment scenarios, including document parsing, OCR, and complex visual reasoning, while maintaining general multimodal capabilities for cloud or edge deployments.

How It Works

The models employ a novel four-stage progressive training strategy, moving from cross-modal alignment to general knowledge injection, then domain-enhanced knowledge injection, and finally post-training alignment. This approach aims to balance broad multimodal understanding with specialized capabilities. It is complemented by high-precision data synthesis pipelines that construct multi-task data using programmatic generation and traditional CV models, improving generalization in long-tail scenarios. The Qianfan-OCR model introduces an optional "Layout-as-Thought" mechanism for structured document reasoning before output generation. Training leverages large-scale Kunlun chip clusters with a 3D parallel strategy and communication-computation fusion.

Quick Start & Requirements

Highlighted Details

  • Qianfan-OCR (4B): A recent release (March 2026) offering end-to-end document understanding, unifying layout analysis, table/formula/chart extraction, and KIE with "Layout-as-Thought."
  • Performance: Achieves top rankings on benchmarks like OmniDocBench v1.5 (93.12, #1 end-to-end) and OCRBench (880, #1 overall). Demonstrates high throughput: 1.024 pages/sec on a single A100.
  • Model Variants: Offers 3B, 8B, and 70B parameter models for diverse deployment needs, from edge to cloud.
  • Capabilities: Supports 192 languages, Chain-of-Thought reasoning (8B/70B), and complex visual reasoning tasks.

Maintenance & Community

The project shows active development with recent news and releases in late 2025 and early 2026. Community interaction is primarily through GitHub Issues.

Licensing & Compatibility

Licensed under the permissive MIT License, generally allowing for commercial use and integration into closed-source projects.

Limitations & Caveats

Specific limitations are not explicitly detailed. The model family offers different sizes, implying trade-offs between capability, resource requirements, and deployment scenarios. The "Layout-as-Thought" feature is optional for Qianfan-OCR.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
3
Star History
176 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.