MiMo-VL  by XiaomiMiMo

Vision-language model for complex reasoning

created 2 months ago
473 stars

Top 65.3% on sourcepulse

GitHubView on GitHub
Project Summary

MiMo-VL-7B is a compact yet powerful vision-language model (VLM) designed for complex reasoning tasks. It targets researchers and developers seeking state-of-the-art open-source VLMs, offering enhanced performance through a novel training methodology.

How It Works

MiMo-VL-7B utilizes a native resolution ViT encoder for fine-grained visual detail, an MLP projector for efficient cross-modal alignment, and a MiMo-7B language model. Training involves a four-stage pre-training process, including projector warmup, vision-language alignment, general multi-modal pre-training, and long-context Supervised Fine-Tuning (SFT). A subsequent post-training phase introduces Mixed On-policy Reinforcement Learning (MORL), integrating diverse reward signals for perception, grounding, reasoning, and human preferences. This approach aims to improve performance across multiple modalities and tasks.

Quick Start & Requirements

  • Models are available on HuggingFace and ModelScope.
  • Compatibility with Qwen2_5_VLForConditionalGeneration architecture for deployment.
  • No specific hardware requirements mentioned beyond typical VLM inference needs.

Highlighted Details

  • Achieves state-of-the-art open-source results in general visual-language understanding and multi-modal reasoning tasks.
  • Demonstrates exceptional GUI understanding and grounding capabilities, comparable to specialized models.
  • Achieves the highest Elo rating among evaluated open-source VLMs (7B to 72B parameters) based on GPT-4o judgments.
  • Incorporates high-quality, broad-coverage reasoning data directly into later pre-training stages for performance gains.

Maintenance & Community

  • Developed by the LLM-Core-Team Xiaomi.
  • Contact available via email (mimo@xiaomi.com) or GitHub issues.

Licensing & Compatibility

  • No license information is explicitly stated in the provided README.
  • Compatibility with Qwen2_5_VLForConditionalGeneration suggests potential integration with existing ecosystems.

Limitations & Caveats

The README notes that achieving stable simultaneous improvements across diverse data domains during MORL training remains challenging due to potential interference.

Health Check
Last commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
5
Star History
474 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.