Qianfan-VL by baidubce

Vision-language models for enterprise document understanding and reasoning

Created 9 months ago

417 stars

Top 69.9% on SourcePulse

Project Summary

Summary

Qianfan-VL is a family of domain-enhanced vision-language models designed for enterprise-grade visual understanding. It offers optimized solutions for high-frequency industrial deployment scenarios, including document parsing, OCR, and complex visual reasoning, while maintaining general multimodal capabilities for cloud or edge deployments.

How It Works

The models employ a novel four-stage progressive training strategy, moving from cross-modal alignment to general knowledge injection, then domain-enhanced knowledge injection, and finally post-training alignment. This approach aims to balance broad multimodal understanding with specialized capabilities. It is complemented by high-precision data synthesis pipelines that construct multi-task data using programmatic generation and traditional CV models, improving generalization in long-tail scenarios. The Qianfan-OCR model introduces an optional "Layout-as-Thought" mechanism for structured document reasoning before output generation. Training leverages large-scale Kunlun chip clusters with a 3D parallel strategy and communication-computation fusion.

Quick Start & Requirements

Installation: pip install transformers torch torchvision pillow.
Prerequisites: Requires PyTorch and Transformers libraries. Example code utilizes torch.bfloat16 and device_map="auto", strongly implying GPU acceleration is necessary for efficient inference. Specific CUDA version or Python version requirements are not detailed.
Resources: Performance benchmarks and training infrastructure (Kunlun P800, A100) indicate significant computational resources are beneficial for optimal operation.
Links:
- Cookbook: https://github.com/baidubce/qianfan-models-cookbook
- HuggingFace Collection: https://huggingface.co/collections/baidubce/qianfan-vl-65f121331313221222
- GitHub: https://github.com/baidubce/Qianfan-VL

Highlighted Details

Qianfan-OCR (4B): A recent release (March 2026) offering end-to-end document understanding, unifying layout analysis, table/formula/chart extraction, and KIE with "Layout-as-Thought."
Performance: Achieves top rankings on benchmarks like OmniDocBench v1.5 (93.12, #1 end-to-end) and OCRBench (880, #1 overall). Demonstrates high throughput: 1.024 pages/sec on a single A100.
Model Variants: Offers 3B, 8B, and 70B parameter models for diverse deployment needs, from edge to cloud.
Capabilities: Supports 192 languages, Chain-of-Thought reasoning (8B/70B), and complex visual reasoning tasks.

Maintenance & Community

The project shows active development with recent news and releases in late 2025 and early 2026. Community interaction is primarily through GitHub Issues.

Licensing & Compatibility

Licensed under the permissive MIT License, generally allowing for commercial use and integration into closed-source projects.

Limitations & Caveats

Specific limitations are not explicitly detailed. The model family offers different sizes, implying trade-offs between capability, resource requirements, and deployment scenarios. The "Layout-as-Thought" feature is optional for Qianfan-OCR.

Qianfan-VL by baidubce

Explore Similar Projects

dots.mocr by rednote-hilab

Vary-toy by Ucas-HaoranWei

AWESOME-OCR-LLM by Yuliang-Liu

yomitoku by kotaro-kinoshita

Logics-Parsing by alibaba

HunyuanOCR by Tencent-Hunyuan

mPLUG-DocOwl by X-PLUG

OmniDocBench by opendatalab

PolyglotPDF by CBIhalsen

dots.ocr by rednote-hilab

Unlimited-OCR by baidu

surya by datalab-to