dots.vlm1  by rednote-hilab

Advanced vision-language model for complex multimodal tasks

Created 2 months ago
259 stars

Top 97.9% on SourcePulse

GitHubView on GitHub
Project Summary

dots.vlm1 is an open-source vision-language model designed for advanced multimodal understanding and reasoning. It targets researchers and engineers working with complex visual data, offering strong performance, particularly in OCR, document analysis, and chart comprehension, while maintaining competitive text-based capabilities. The model provides a powerful, accessible alternative to proprietary systems.

How It Works

dots.vlm1 integrates a custom-trained 1.2 billion-parameter NaViT vision encoder with the DeepSeek V3 large language model. The NaViT encoder is built from scratch, supporting dynamic resolution and incorporating pure visual supervision alongside text supervision to maximize perceptual capacity. Training utilizes a diverse corpus including synthetic data, structured image data for OCR enhancement, and rewritten web data, aiming for superior performance across various multimodal tasks.

Quick Start & Requirements

Environment setup is streamlined via Docker, with a recommended pre-built image (rednotehilab/dots.vlm1_sglang:v0.4.9.post1-cu126) providing immediate support. Alternatively, a manual installation requires cloning a custom SGLang branch (https://github.com/rednote-hilab/sglang/pull/8778 for the PR). Multi-node deployment is supported, requiring careful configuration of IP addresses and launch parameters. Key parameters include --tp 16 for tensor parallelism, --nnodes for cluster size, --context-length 65536, and --quantization fp8. The model exposes an OpenAI-compatible API for integration.

Highlighted Details

  • Achieves near state-of-the-art performance on various benchmarks, notably excelling in OCR and document understanding tasks like charxiv(dq) (92.1) and DOCVQA (96.52).
  • Demonstrates strong reasoning capabilities, scoring 85.0 on MathVista and 70.11 on MMMU_pro.
  • The NaViT vision encoder is trained from scratch, offering enhanced perceptual capacity through dynamic resolution and pure visual supervision.
  • Supports multi-node deployment with tensor parallelism up to 16 GPUs per node for scalable inference.

Maintenance & Community

The project is maintained by rednote-hilab, with acknowledgments to the DeepSeek team. Links to a blog post detailing the model and HuggingFace weights are provided. No specific community channels (e.g., Discord, Slack) or roadmap links are present in the README.

Licensing & Compatibility

The license type and compatibility for commercial use or closed-source linking are not specified in the provided README content.

Limitations & Caveats

The project relies on a custom, unmerged branch of the SGLang library, potentially introducing dependency management complexities. Multi-node deployment requires careful setup and significant GPU resources. FP8 quantization may lead to minor trade-offs in precision for certain tasks.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
16 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Wing Lian Wing Lian(Founder of Axolotl AI), and
10 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
Created 3 years ago
Updated 1 year ago
Feedback? Help us improve.