MOSS-VL by OpenMOSS

Advanced multimodal model for deep visual and video understanding

Created 3 months ago

273 stars

Top 94.3% on SourcePulse

Project Summary

MOSS-VL is the core multimodal model series from OpenMOSS, engineered for advanced visual understanding, particularly complex video comprehension. It targets researchers and power users seeking robust generalization and intricate vision-language correlations, offering a systematic scaling strategy across data, parameters, and context for long-form video reasoning.

How It Works

MOSS-VL employs a cross-attention architecture that decouples visual encoding from cognitive reasoning, significantly reducing latency for dynamic video streams. It natively supports interleaved modalities within a unified pipeline. Key innovations include: Absolute Timestamps injected with frames for precise temporal grounding, enabling variable FPS handling and fine-grained action localization. Cross-attention RoPE (XRoPE) maps text and video into a unified 3D (Time, Height, Width) coordinate space, optimizing cross-modal alignment and precise spatio-temporal localization within the video volume.

Quick Start & Requirements

Installation: Requires Python 3.12. Install via pip install -r requirements.txt after setting up a conda environment.
Prerequisites: torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2". GPU recommended for device_map="auto".
Demos & Docs: HuggingFace Demo available. Inference and fine-tuning documentation are provided in respective subdirectories.

Highlighted Details

Performance: Achieves state-of-the-art results, consistently ranking top-tier against baselines like Qwen2.5-VL and Qwen3-VL.
Video Intelligence: Leads in Video Understanding (65.8 score), excelling in temporal consistency and action recognition across benchmarks like VSI-bench.
Multimodal Capabilities: Delivers outstanding general image-text understanding, fine-grained recognition, spatial reasoning, and robust logical inference.
Document Analysis: Provides dependable OCR and document understanding capabilities (83.9 score).

Maintenance & Community

Developed by the OpenMOSS Team, with acknowledgements to NVIDIA, Qwen Team, and SGLang Team for infrastructure and tooling support. Upcoming roadmap items include full training code, a real-time video model, and RL post-training. No direct community links (Discord/Slack) are provided.

Licensing & Compatibility

The specific open-source license is not detailed in the provided README. Compatibility for commercial use or closed-source linking is therefore undetermined.

Limitations & Caveats

The project is actively developing; full training code is yet to be released. Reinforcement Learning from Human Feedback (RLHF) training is ongoing. Real-time video understanding is a future development goal.

Health Check

Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

15 stars in the last 30 days