Discover and explore top open-source AI tools and projects—updated daily.
Advanced vision-language model for complex multimodal tasks
Top 97.9% on SourcePulse
dots.vlm1 is an open-source vision-language model designed for advanced multimodal understanding and reasoning. It targets researchers and engineers working with complex visual data, offering strong performance, particularly in OCR, document analysis, and chart comprehension, while maintaining competitive text-based capabilities. The model provides a powerful, accessible alternative to proprietary systems.
How It Works
dots.vlm1 integrates a custom-trained 1.2 billion-parameter NaViT vision encoder with the DeepSeek V3 large language model. The NaViT encoder is built from scratch, supporting dynamic resolution and incorporating pure visual supervision alongside text supervision to maximize perceptual capacity. Training utilizes a diverse corpus including synthetic data, structured image data for OCR enhancement, and rewritten web data, aiming for superior performance across various multimodal tasks.
Quick Start & Requirements
Environment setup is streamlined via Docker, with a recommended pre-built image (rednotehilab/dots.vlm1_sglang:v0.4.9.post1-cu126
) providing immediate support. Alternatively, a manual installation requires cloning a custom SGLang branch (https://github.com/rednote-hilab/sglang/pull/8778
for the PR). Multi-node deployment is supported, requiring careful configuration of IP addresses and launch parameters. Key parameters include --tp 16
for tensor parallelism, --nnodes
for cluster size, --context-length 65536
, and --quantization fp8
. The model exposes an OpenAI-compatible API for integration.
Highlighted Details
charxiv(dq)
(92.1) and DOCVQA
(96.52).MathVista
and 70.11 on MMMU_pro
.Maintenance & Community
The project is maintained by rednote-hilab, with acknowledgments to the DeepSeek team. Links to a blog post detailing the model and HuggingFace weights are provided. No specific community channels (e.g., Discord, Slack) or roadmap links are present in the README.
Licensing & Compatibility
The license type and compatibility for commercial use or closed-source linking are not specified in the provided README content.
Limitations & Caveats
The project relies on a custom, unmerged branch of the SGLang library, potentially introducing dependency management complexities. Multi-node deployment requires careful setup and significant GPU resources. FP8 quantization may lead to minor trade-offs in precision for certain tasks.
2 weeks ago
Inactive