MM-Eureka-V0 by FanqingM

Multimodal training code for geometry problem solving

Created 11 months ago

322 stars

Top 84.5% on SourcePulse

Project Summary

This repository, MM-Eureka-V0 (also known as R1-Multimodal-Journey), addresses challenges in multimodal reasoning, particularly for complex tasks like geometry problems. It targets researchers and engineers working with multimodal large language models (VLM), aiming to improve their reasoning capabilities and training efficiency. The project offers a faster training process and explores reinforcement learning techniques for VLMs.

How It Works

MM-Eureka-V0 enhances training speed by integrating vLLM, achieving a 5-6x speedup over previous implementations. It explores reinforcement learning (RL) strategies, similar to R1, to improve performance on challenging geometry problems, using a subset of the geo170k dataset. The project notes that "aha moments" can emerge early in training, even with smaller models.

Quick Start & Requirements

Install: pip install vllm==0.7.2 trl==0.15.0.dev0
Prerequisites: Transformers 4.49.0.dev0 (for Qwen2.5_VL), Python, CUDA (implied by vLLM).
Data Preparation: Modify paths in local_scripts/gen_dataset.py and run python local_scripts/gen_dataset.py. Images are stored as paths, not PIL format, for vLLM compatibility.
Training: Modify paths in local_scripts/train_qwen2_5_3b.sh and run sh local_scripts/train_qwen2_5_3b.sh.
Evaluation: python eval/evaluate_mathvista.py --checkpoint ${CHECKPOINT} --datasets MathVista_testmini
Links: Latest progress: https://github.com/ModalMinds/MM-EUREKA, Environment setup: https://github.com/FanqingM/R1-Multimodal-Journey

Highlighted Details

Achieves 5-6x faster training using vLLM.
Demonstrates "aha moments" in reasoning early in training.
Reinforcement learning shows higher data efficiency than SFT for answer correctness.
VLMs struggle to replicate length increase patterns seen in LLMs.

Maintenance & Community

Core contributors include Lingxiao Du, Xiangyan Liu, and Fanqing Meng. Project leaders are Wenqi Shao and Qiaosheng Zhang. Interns are being sought at Shanghai AI Lab.

Licensing & Compatibility

The README does not explicitly state the license. It mentions building upon Open-R1-Multimodal, vLLM, and trl, and gratitude towards DeepSeek-R1 and Qwen2.5-VL, suggesting potential compatibility with their licenses.

Limitations & Caveats

VLMs appear to struggle with length increase patterns and require high-quality, scarce multimodal reasoning data. The project notes that simple datasets can lead to overfitting. The default vLLM generation uses cuda:7, potentially limiting training on systems with fewer GPUs.

MM-Eureka-V0 by FanqingM

Explore Similar Projects

ARPO by RUC-NLPIR

One-Shot-RLVR by ypwang61

EasyReinforcementLearning by alibaba

lmm-r1 by TideDra

open-r1-multimodal by EvolvingLMMs-Lab

DAPO by BytedTsinghua-SIA

Open-Reasoner-Zero by Open-Reasoner-Zero

RL-Factory by Simple-Efficient

simpleRL-reason by hkust-nlp

R1-V by StarsfieldAI

EasyR1 by hiyouga

rl by pytorch