dots.vlm1 by rednote-hilab

Advanced vision-language model for complex multimodal tasks

Created 3 months ago

265 stars

Top 96.6% on SourcePulse

Project Summary

dots.vlm1 is an open-source vision-language model designed for advanced multimodal understanding and reasoning. It targets researchers and engineers working with complex visual data, offering strong performance, particularly in OCR, document analysis, and chart comprehension, while maintaining competitive text-based capabilities. The model provides a powerful, accessible alternative to proprietary systems.

How It Works

dots.vlm1 integrates a custom-trained 1.2 billion-parameter NaViT vision encoder with the DeepSeek V3 large language model. The NaViT encoder is built from scratch, supporting dynamic resolution and incorporating pure visual supervision alongside text supervision to maximize perceptual capacity. Training utilizes a diverse corpus including synthetic data, structured image data for OCR enhancement, and rewritten web data, aiming for superior performance across various multimodal tasks.

Quick Start & Requirements

Environment setup is streamlined via Docker, with a recommended pre-built image (rednotehilab/dots.vlm1_sglang:v0.4.9.post1-cu126) providing immediate support. Alternatively, a manual installation requires cloning a custom SGLang branch (https://github.com/rednote-hilab/sglang/pull/8778 for the PR). Multi-node deployment is supported, requiring careful configuration of IP addresses and launch parameters. Key parameters include --tp 16 for tensor parallelism, --nnodes for cluster size, --context-length 65536, and --quantization fp8. The model exposes an OpenAI-compatible API for integration.

Highlighted Details

Achieves near state-of-the-art performance on various benchmarks, notably excelling in OCR and document understanding tasks like charxiv(dq) (92.1) and DOCVQA (96.52).
Demonstrates strong reasoning capabilities, scoring 85.0 on MathVista and 70.11 on MMMU_pro.
The NaViT vision encoder is trained from scratch, offering enhanced perceptual capacity through dynamic resolution and pure visual supervision.
Supports multi-node deployment with tensor parallelism up to 16 GPUs per node for scalable inference.

Maintenance & Community

The project is maintained by rednote-hilab, with acknowledgments to the DeepSeek team. Links to a blog post detailing the model and HuggingFace weights are provided. No specific community channels (e.g., Discord, Slack) or roadmap links are present in the README.

Licensing & Compatibility

The license type and compatibility for commercial use or closed-source linking are not specified in the provided README content.

Limitations & Caveats

The project relies on a custom, unmerged branch of the SGLang library, potentially introducing dependency management complexities. Multi-node deployment requires careful setup and significant GPU resources. FP8 quantization may lead to minor trade-offs in precision for certain tasks.

dots.vlm1 by rednote-hilab

Explore Similar Projects

TokenFlow by ByteVisionLab

NExT-Chat by NExT-ChatV

cobra by h-zhao1997

Valley by bytedance

Thyme by yfzhang114

Rex-Omni by IDEA-Research

Seed1.5-VL by ByteDance-Seed

Vary by Ucas-HaoranWei

open_flamingo by mlfoundations

Bagel by ByteDance-Seed

Qwen-VL by QwenLM

Janus by deepseek-ai