open-r1-multimodal by EvolvingLMMs-Lab

Multimodal training fork for open-r1

Created 11 months ago

1,433 stars

Top 28.2% on SourcePulse

2 Experts Love This Project

zhiyuan8

Cofounder of Nexa AI

lewtun

Research Engineer at Hugging Face

Project Summary

This repository provides a fork of open-r1 to enable multimodal model training, specifically focusing on Reinforcement Learning from Human Feedback (RLHF) for multimodal reasoning tasks. It targets researchers and developers interested in advancing multimodal AI capabilities, offering a framework and initial datasets for training and evaluating models like Qwen2-VL with GRPO.

How It Works

The project integrates multimodal capabilities into the open-r1 framework, leveraging the GRPO algorithm. It supports various Vision-Language Models (VLMs) available in the Hugging Face transformers library, including Qwen2-VL and Aria-MoE. The core innovation lies in its approach to multimodal RL training, exemplified by the creation of an 8k multimodal RL training dataset focused on math reasoning, generated with GPT-4o and including verifiable answers and reasoning paths.

Quick Start & Requirements

Install dependencies: pip3 install vllm==0.6.6.post1, pip3 install -e ".[dev]", pip3 install wandb==0.18.3.
Training command example: torchrun --nproc_per_node=8 ... src/open_r1/grpo.py ...
Prerequisites: 8x H100 GPUs (80GB each) for the provided training configuration, Python 3.x, wandb for logging.
Links: multimodal-open-r1-8k-verified dataset, trained models, local_scripts/train_qwen2_vl.sh, local_scripts/lmms_eval_qwen2vl.sh.

Highlighted Details

Implements multimodal R1 based on huggingface/open-r1 and deepseek-ai/DeepSeek-R1.
Integrates Qwen2-VL series, Aria-MoE, and other VLMs.
Open-sourced 8k multimodal RL training examples for math reasoning, generated by GPT4o.
Open-sourced GRPO-trained models: lmms-lab/Qwen2-VL-2B-GRPO-8k and lmms-lab/Qwen2-VL-7B-GRPO-8k.
Customizes verification logic for multiple-choice math problems.
Demonstrates improved performance in reasoning-based chain-of-thought (CoT) settings compared to base models.

Maintenance & Community

Community feedback is welcomed for improving understanding of multimodal reasoning models.
Plans to PR to open-r1 for better community support.
Discussions on dataset curation and scaling efficiency are ongoing.

Licensing & Compatibility

The repository itself does not explicitly state a license in the README. The underlying open-r1 and transformers libraries have their own licenses (typically Apache 2.0 or MIT). The datasets and trained models are hosted on Hugging Face, implying their respective licenses apply.

Limitations & Caveats

Current framework is not efficient for large-scale training; 1 epoch for Qwen2-VL-2B takes 10 hours on 8 H100s.
Initial models may quickly optimize for reward format over accuracy.
Evaluation frameworks for visual reasoning tasks are limited in processing step-by-step reasoning traces.
Expanding RL datasets beyond math scenarios with verifiable answers requires further exploration.

Health Check

Last Commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

8 stars in the last 30 days

Explore Similar Projects

MM-Eureka-V0 by FanqingM

Multimodal training code for geometry problem solving

Created 11 months ago

Updated 6 months ago

lmms-finetune by zjysteven

Minimal codebase for finetuning large multimodal models

Created 1 year ago

Updated 1 month ago

Moxin-LLM by moxin-org

Open-source LLM family for research, reproducibility, and transparency

Created 1 year ago

Updated 6 months ago

lmm-r1 by TideDra

RL framework for multimodal reasoning in 3B LMMs

Created 11 months ago

Updated 8 months ago

Awesome-RL-based-Reasoning-MLLMs by Sun-Haoyuan23

Curated list for RL-based reasoning in multimodal LLMs

Created 10 months ago

Updated 1 month ago

Starred by

Phil Wang

Phil Wang(Prolific Research Paper Implementer).

molmo by allenai

Multimodal open language model code, training, and evaluation

Created 1 year ago

Updated 1 year ago

LLM-Dojo by mst272

LLM training framework for model training and RLHF

Created 1 year ago

Updated 1 month ago

Starred by

Zack Li

Zack Li(Cofounder of Nexa AI).

Visual-RFT by Liuziyu77

Visual reinforcement fine-tuning research paper

Created 10 months ago

Updated 2 months ago

Starred by

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera) and

Luca Soldaini

Luca Soldaini(Research Scientist at Ai2).

OLMo-core by allenai

PyTorch building blocks for large language model training and inference

Created 1 year ago

Updated 1 day ago

Transformers-for-NLP-and-Computer-Vision-3rd-Edition by Denis2054

Code repo for exploring Generative AI and LLMs

Created 2 years ago

Updated 5 months ago

Starred by

Shizhe Diao

Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA) and

Alex Chen

Alex Chen(Cofounder of Nexa AI).

EasyR1 by hiyouga

RL training framework for multi-modality models

Created 10 months ago

Updated 6 days ago

Starred by

Clement Delangue

Clement Delangue(Cofounder of Hugging Face),

Lilian Weng

Lilian Weng(Cofounder of Thinking Machines Lab), and

99 more.

transformers by huggingface

ML library for pretrained model inference and training

Created 7 years ago

Updated 2 days ago

Feedback? Help us improve.