Discover and explore top open-source AI tools and projects—updated daily.
OpenMOSSAdvanced multimodal model for deep visual and video understanding
Top 99.3% on SourcePulse
MOSS-VL is the core multimodal model series from OpenMOSS, engineered for advanced visual understanding, particularly complex video comprehension. It targets researchers and power users seeking robust generalization and intricate vision-language correlations, offering a systematic scaling strategy across data, parameters, and context for long-form video reasoning.
How It Works
MOSS-VL employs a cross-attention architecture that decouples visual encoding from cognitive reasoning, significantly reducing latency for dynamic video streams. It natively supports interleaved modalities within a unified pipeline. Key innovations include: Absolute Timestamps injected with frames for precise temporal grounding, enabling variable FPS handling and fine-grained action localization. Cross-attention RoPE (XRoPE) maps text and video into a unified 3D (Time, Height, Width) coordinate space, optimizing cross-modal alignment and precise spatio-temporal localization within the video volume.
Quick Start & Requirements
pip install -r requirements.txt after setting up a conda environment.torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2". GPU recommended for device_map="auto".Highlighted Details
Maintenance & Community
Developed by the OpenMOSS Team, with acknowledgements to NVIDIA, Qwen Team, and SGLang Team for infrastructure and tooling support. Upcoming roadmap items include full training code, a real-time video model, and RL post-training. No direct community links (Discord/Slack) are provided.
Licensing & Compatibility
The specific open-source license is not detailed in the provided README. Compatibility for commercial use or closed-source linking is therefore undetermined.
Limitations & Caveats
The project is actively developing; full training code is yet to be released. Reinforcement Learning from Human Feedback (RLHF) training is ongoing. Real-time video understanding is a future development goal.
1 day ago
Inactive