MiMo-VL by XiaomiMiMo

Vision-language model for complex reasoning

Created 7 months ago

618 stars

Top 53.4% on SourcePulse

Project Summary

MiMo-VL-7B is a compact yet powerful vision-language model (VLM) designed for complex reasoning tasks. It targets researchers and developers seeking state-of-the-art open-source VLMs, offering enhanced performance through a novel training methodology.

How It Works

MiMo-VL-7B utilizes a native resolution ViT encoder for fine-grained visual detail, an MLP projector for efficient cross-modal alignment, and a MiMo-7B language model. Training involves a four-stage pre-training process, including projector warmup, vision-language alignment, general multi-modal pre-training, and long-context Supervised Fine-Tuning (SFT). A subsequent post-training phase introduces Mixed On-policy Reinforcement Learning (MORL), integrating diverse reward signals for perception, grounding, reasoning, and human preferences. This approach aims to improve performance across multiple modalities and tasks.

Quick Start & Requirements

Models are available on HuggingFace and ModelScope.
Compatibility with Qwen2_5_VLForConditionalGeneration architecture for deployment.
No specific hardware requirements mentioned beyond typical VLM inference needs.

Highlighted Details

Achieves state-of-the-art open-source results in general visual-language understanding and multi-modal reasoning tasks.
Demonstrates exceptional GUI understanding and grounding capabilities, comparable to specialized models.
Achieves the highest Elo rating among evaluated open-source VLMs (7B to 72B parameters) based on GPT-4o judgments.
Incorporates high-quality, broad-coverage reasoning data directly into later pre-training stages for performance gains.

Maintenance & Community

Developed by the LLM-Core-Team Xiaomi.
Contact available via email (mimo@xiaomi.com) or GitHub issues.

Licensing & Compatibility

No license information is explicitly stated in the provided README.
Compatibility with Qwen2_5_VLForConditionalGeneration suggests potential integration with existing ecosystems.

Limitations & Caveats

The README notes that achieving stable simultaneous improvements across diverse data domains during MORL training remains challenging due to potential interference.

MiMo-VL by XiaomiMiMo

Explore Similar Projects

Efficient-R1-VLLM by baibizhe

Awesome_Efficient_LRM_Reasoning by XiaoYee

VisionReasoner by JIA-Lab-research

Awesome-Large-Multimodal-Reasoning-Models by HITsz-TMG

OneThinker by tulerfeng

POLARIS by ChenxinAn-fdu

lmm-r1 by TideDra

Seed1.5-VL by ByteDance-Seed

Awesome-RL-based-Reasoning-MLLMs by Sun-Haoyuan23

Visual-RFT by Liuziyu77

Skywork-R1V by SkyworkAI

VLM-R1 by om-ai-lab