Discover and explore top open-source AI tools and projects—updated daily.
xytian1008Advancing multimodal reasoning in vision-language models
Top 79.5% on SourcePulse
Summary
This repository provides the official implementation for "More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models" (ICLR 2026). It addresses the trade-off between reasoning depth and perceptual accuracy in VLMs, offering a novel Vision-Anchored Policy Optimization (VAPO) method. VAPO aims to improve VLM reasoning by grounding it in visual perception, achieving new state-of-the-art results and enhancing the effective utilization of reasoning capabilities for researchers and engineers.
How It Works
VAPO is a policy gradient algorithm designed as a multimodal replacement for GRPO. Its core innovation lies in embedding "visual anchors" throughout the reasoning process. At each anchor, the model's perceptual grounding is evaluated via primitive visual claims. This approach introduces a "perception reward" alongside standard outcome rewards, explicitly steering reasoning towards visually grounded trajectories. This methodology combats visual forgetting and mitigates the degradation of perceptual accuracy often associated with prolonged reasoning.
Quick Start & Requirements
git clone https://github.com/xytian1008/VAPO.git), navigate to the directory (cd VAPO), and install using pip install -e ..xytian1008/VAPO-Thinker-train36k and xytian1008/VAPO-Thinker-val1k.Highlighted Details
Maintenance & Community
The project relies on established open-source frameworks like Easy-R1, Verl, and VLMEvalKit for training and evaluation. Computational resources were provided by Lambda GPU Cloud and Maincode. No specific community channels (e.g., Discord, Slack) or roadmap links are detailed in the README.
Licensing & Compatibility
This project is licensed under the MIT License, which is permissive and generally compatible with commercial use and closed-source linking.
Limitations & Caveats
The research highlights a fundamental trade-off where excessive reasoning can impair perceptual accuracy, particularly in visually demanding tasks. While VAPO aims to mitigate this by enhancing visual grounding, the inherent challenge persists. The reported performance gains are based on specific benchmarks and model scales, and broader applicability may require further validation.
1 month ago
Inactive