RL-tuned MoE vision-language model for reasoning tasks
Top 85.2% on sourcepulse
This project introduces Efficient-R1-VLLM, a novel approach to fine-tuning Mixture-of-Experts (MoE) vision-language models (VLLMs) for enhanced multimodal reasoning. Targeting researchers and developers working with VLLMs, it offers improved reasoning capabilities and training efficiency by applying reinforcement learning.
How It Works
Efficient-R1-VLLM pioneers the application of Proximal Policy Optimization (PPO) based reinforcement learning, specifically GRPO, to fine-tune the DeepSeek2-VL MoE model. The core innovation lies in modifying the training pipeline to enforce image caption generation before the reasoning output. This strategy, validated by performance improvements on the Qwen-7B-Instruct model, aims to better integrate visual information into the model's reasoning process. Training efficiency is further boosted by leveraging SGLang for accelerated trajectory sampling.
Quick Start & Requirements
pip install -r requirements.txt
pip install "sglang[all]"
Highlighted Details
Maintenance & Community
The project acknowledges contributions from Bai Bizhe, Professor Wenqi Shao, and Qiaosheng Zhang. It builds upon and integrates open-source contributions from vLLM, Open-R1, and trl, and extends gratitude to DeepSeek-R1 and Qwen2.5-VL.
Licensing & Compatibility
The README does not explicitly state the license type or any compatibility notes for commercial use.
Limitations & Caveats
The project states that quick start code will be available soon, indicating it may still be under active development or not yet fully released for easy adoption.
4 months ago
Inactive