MM-Eureka-V0  by FanqingM

Multimodal training code for geometry problem solving

created 5 months ago
313 stars

Top 87.4% on sourcepulse

GitHubView on GitHub
Project Summary

This repository, MM-Eureka-V0 (also known as R1-Multimodal-Journey), addresses challenges in multimodal reasoning, particularly for complex tasks like geometry problems. It targets researchers and engineers working with multimodal large language models (VLM), aiming to improve their reasoning capabilities and training efficiency. The project offers a faster training process and explores reinforcement learning techniques for VLMs.

How It Works

MM-Eureka-V0 enhances training speed by integrating vLLM, achieving a 5-6x speedup over previous implementations. It explores reinforcement learning (RL) strategies, similar to R1, to improve performance on challenging geometry problems, using a subset of the geo170k dataset. The project notes that "aha moments" can emerge early in training, even with smaller models.

Quick Start & Requirements

  • Install: pip install vllm==0.7.2 trl==0.15.0.dev0
  • Prerequisites: Transformers 4.49.0.dev0 (for Qwen2.5_VL), Python, CUDA (implied by vLLM).
  • Data Preparation: Modify paths in local_scripts/gen_dataset.py and run python local_scripts/gen_dataset.py. Images are stored as paths, not PIL format, for vLLM compatibility.
  • Training: Modify paths in local_scripts/train_qwen2_5_3b.sh and run sh local_scripts/train_qwen2_5_3b.sh.
  • Evaluation: python eval/evaluate_mathvista.py --checkpoint ${CHECKPOINT} --datasets MathVista_testmini
  • Links: Latest progress: https://github.com/ModalMinds/MM-EUREKA, Environment setup: https://github.com/FanqingM/R1-Multimodal-Journey

Highlighted Details

  • Achieves 5-6x faster training using vLLM.
  • Demonstrates "aha moments" in reasoning early in training.
  • Reinforcement learning shows higher data efficiency than SFT for answer correctness.
  • VLMs struggle to replicate length increase patterns seen in LLMs.

Maintenance & Community

Core contributors include Lingxiao Du, Xiangyan Liu, and Fanqing Meng. Project leaders are Wenqi Shao and Qiaosheng Zhang. Interns are being sought at Shanghai AI Lab.

Licensing & Compatibility

The README does not explicitly state the license. It mentions building upon Open-R1-Multimodal, vLLM, and trl, and gratitude towards DeepSeek-R1 and Qwen2.5-VL, suggesting potential compatibility with their licenses.

Limitations & Caveats

VLMs appear to struggle with length increase patterns and require high-quality, scarce multimodal reasoning data. The project notes that simple datasets can lead to overfitting. The default vLLM generation uses cuda:7, potentially limiting training on systems with fewer GPUs.

Health Check
Last commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
21 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.