RL framework for multimodal reasoning in 3B LMMs
Top 45.0% on sourcepulse
This repository provides LMM-R1, a framework for enhancing the reasoning abilities of smaller (3B parameter) Large Multimodal Models (LMMs). It addresses the limitations of small models in complex reasoning tasks and the scarcity of high-quality multimodal reasoning data by employing a two-stage, rule-based Reinforcement Learning (RL) approach. The target audience includes researchers and developers working with LMMs who need to improve their reasoning capabilities, particularly in multimodal contexts.
How It Works
LMM-R1 utilizes a two-stage RL framework: Foundational Reasoning Enhancement (FRE) and Multimodal Generalization Training (MGT). FRE leverages text-only data to build a strong reasoning foundation, while MGT extends these capabilities to multimodal inputs. This staged approach is designed to overcome data limitations and efficiently improve performance on diverse reasoning tasks, offering a more robust and scalable method for LMM reasoning enhancement compared to direct training.
Quick Start & Requirements
git clone https://github.com/TideDra/lmm-r1.git
cd lmm-r1
pip install -e .[vllm]
pip install flash_attn --no-build-isolation
Highlighted Details
Maintenance & Community
The codebase has been merged into OpenRLHF-M, the official multimodal RL infrastructure from OpenRLHF. Further community engagement details are not explicitly provided in the README.
Licensing & Compatibility
The repository is released under the Apache 2.0 license, allowing for commercial use and integration with closed-source projects.
Limitations & Caveats
The project is presented as a reproduction of DeepSeek-R1 and has been merged into OpenRLHF-M. While it supports various LMMs, the primary focus is on enhancing reasoning for smaller 3B models.
2 months ago
1 day