VLM reinforcement learning framework
Top 88.8% on sourcepulse
This repository provides V-Triune, a unified Reinforcement Learning (RL) system for advancing Vision-Language Models (VLMs). It enables VLMs to jointly master visual reasoning and perception tasks within a single training pipeline, offering significant performance gains on benchmarks like MEGA-Bench Core. The system is designed for researchers and engineers working on multimodal AI and VLM development.
How It Works
V-Triune unifies diverse VLM tasks through three core components: sample-level data formatting, verifier-level reward computation, and source-level metric monitoring. It introduces a novel Dynamic IoU reward mechanism for improved stability and performance on perception tasks. This unified RL approach allows a single framework to handle both reasoning (e.g., math, puzzles) and perception (e.g., detection, grounding) tasks simultaneously.
Quick Start & Requirements
pip install -e .
.ninja
. Docker is also supported.huggingface-cli
.scripts/reward_server.sh
.Highlighted Details
Maintenance & Community
The project is actively developed by MiniMax AI. Updates and releases are announced via the repository. Further community engagement channels are not explicitly listed.
Licensing & Compatibility
The repository and associated models are publicly available, encouraging research and development. Specific licensing details for commercial use or redistribution are not detailed in the README.
Limitations & Caveats
The setup requires a distributed Ray cluster and a separate reward server, adding complexity to deployment. The project relies on specific versions of PyTorch and FlashAttention, which may require careful environment management. Training configuration involves numerous environment variables that must be correctly set.
2 months ago
Inactive