One-RL-to-See-Them-All  by MiniMax-AI

VLM reinforcement learning framework

created 2 months ago
305 stars

Top 88.8% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides V-Triune, a unified Reinforcement Learning (RL) system for advancing Vision-Language Models (VLMs). It enables VLMs to jointly master visual reasoning and perception tasks within a single training pipeline, offering significant performance gains on benchmarks like MEGA-Bench Core. The system is designed for researchers and engineers working on multimodal AI and VLM development.

How It Works

V-Triune unifies diverse VLM tasks through three core components: sample-level data formatting, verifier-level reward computation, and source-level metric monitoring. It introduces a novel Dynamic IoU reward mechanism for improved stability and performance on perception tasks. This unified RL approach allows a single framework to handle both reasoning (e.g., math, puzzles) and perception (e.g., detection, grounding) tasks simultaneously.

Quick Start & Requirements

  • Installation: Clone the repository and install via pip install -e ..
  • Prerequisites: Python 3.12, PyTorch 2.6.0 with CUDA 12.4, FlashAttention 2.7.3, and ninja. Docker is also supported.
  • Data: Download the Orsta-Data-47k dataset using huggingface-cli.
  • Distributed Training: Requires a Ray cluster setup.
  • Reward Server: A separate reward server must be launched using scripts/reward_server.sh.
  • Configuration: Environment variables for Ray cluster, reward server Job ID, data paths, model loading, and training parameters are required.
  • Resources: Training requires significant GPU resources (e.g., 8 GPUs per node recommended) and substantial disk space for the dataset.
  • Docs: Getting Started Guide

Highlighted Details

  • Achieves up to +14.1% on MEGA-Bench Core across 8 diverse tasks (4 reasoning, 4 perception).
  • Supports Orsta models ranging from 7B to 32B parameters.
  • Features a novel Dynamic IoU reward for adaptive, progressive feedback.
  • Provides public access to the V-Triune system and Orsta model weights.

Maintenance & Community

The project is actively developed by MiniMax AI. Updates and releases are announced via the repository. Further community engagement channels are not explicitly listed.

Licensing & Compatibility

The repository and associated models are publicly available, encouraging research and development. Specific licensing details for commercial use or redistribution are not detailed in the README.

Limitations & Caveats

The setup requires a distributed Ray cluster and a separate reward server, adding complexity to deployment. The project relies on specific versions of PyTorch and FlashAttention, which may require careful environment management. Training configuration involves numerous environment variables that must be correctly set.

Health Check
Last commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
308 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.