Vision-language model for complex reasoning
Top 65.3% on sourcepulse
MiMo-VL-7B is a compact yet powerful vision-language model (VLM) designed for complex reasoning tasks. It targets researchers and developers seeking state-of-the-art open-source VLMs, offering enhanced performance through a novel training methodology.
How It Works
MiMo-VL-7B utilizes a native resolution ViT encoder for fine-grained visual detail, an MLP projector for efficient cross-modal alignment, and a MiMo-7B language model. Training involves a four-stage pre-training process, including projector warmup, vision-language alignment, general multi-modal pre-training, and long-context Supervised Fine-Tuning (SFT). A subsequent post-training phase introduces Mixed On-policy Reinforcement Learning (MORL), integrating diverse reward signals for perception, grounding, reasoning, and human preferences. This approach aims to improve performance across multiple modalities and tasks.
Quick Start & Requirements
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The README notes that achieving stable simultaneous improvements across diverse data domains during MORL training remains challenging due to potential interference.
1 week ago
Inactive