VAPO by xytian1008

Advancing multimodal reasoning in vision-language models

Created 9 months ago

387 stars

Top 73.6% on SourcePulse

Project Summary

Summary

This repository provides the official implementation for "More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models" (ICLR 2026). It addresses the trade-off between reasoning depth and perceptual accuracy in VLMs, offering a novel Vision-Anchored Policy Optimization (VAPO) method. VAPO aims to improve VLM reasoning by grounding it in visual perception, achieving new state-of-the-art results and enhancing the effective utilization of reasoning capabilities for researchers and engineers.

How It Works

VAPO is a policy gradient algorithm designed as a multimodal replacement for GRPO. Its core innovation lies in embedding "visual anchors" throughout the reasoning process. At each anchor, the model's perceptual grounding is evaluated via primitive visual claims. This approach introduces a "perception reward" alongside standard outcome rewards, explicitly steering reasoning towards visually grounded trajectories. This methodology combats visual forgetting and mitigates the degradation of perceptual accuracy often associated with prolonged reasoning.

Quick Start & Requirements

Installation: Clone the repository (git clone https://github.com/xytian1008/VAPO.git), navigate to the directory (cd VAPO), and install using pip install -e ..
Prerequisites: Python ≥ 3.9, CUDA-compatible GPUs, PyTorch, vLLM >= 0.8.0, and Transformers >= 4.51.0.
Dataset: Training data is available on Hugging Face: xytian1008/VAPO-Thinker-train36k and xytian1008/VAPO-Thinker-val1k.
Pretrained Models: VAPO-Thinker checkpoints (7B and 3B) based on Qwen2.5-VL are released on Hugging Face.
Links: Project Page: https://xytian1008.github.io/VAPO/, arXiv: https://arxiv.org/abs/2509.25848, GitHub: https://github.com/xytian1008/VAPO.

Highlighted Details

Key Finding: Extended reasoning does not guarantee improved accuracy; early reasoning stages boost performance, but later stages saturate or degrade it.
Perceptual Degradation: Increased model "thinking" correlates with higher perception errors, where visual details are misinterpreted.
Task Sensitivity: The negative impact of reasoning on perception is most pronounced in vision-heavy tasks.
Mitigation: Encouraging models to attend more frequently to visual input effectively raises the upper bound of reasoning performance.
Performance: VAPO achieves new state-of-the-art results, improving accuracy by 2% on math problems (49.1% → 51.1%) and 3.2% on general tasks (59.9% → 63.1%).
Visual Grounding: VAPO demonstrates a gentler decline in visual attention ratio, strengthening visual cue contributions and leading to steadily increasing accuracy.

Maintenance & Community

The project relies on established open-source frameworks like Easy-R1, Verl, and VLMEvalKit for training and evaluation. Computational resources were provided by Lambda GPU Cloud and Maincode. No specific community channels (e.g., Discord, Slack) or roadmap links are detailed in the README.

Licensing & Compatibility

This project is licensed under the MIT License, which is permissive and generally compatible with commercial use and closed-source linking.

Limitations & Caveats

The research highlights a fundamental trade-off where excessive reasoning can impair perceptual accuracy, particularly in visually demanding tasks. While VAPO aims to mitigate this by enhancing visual grounding, the inherent challenge persists. The reported performance gains are based on specific benchmarks and model scales, and broader applicability may require further validation.

Health Check

Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days