Discover and explore top open-source AI tools and projects—updated daily.
TIGER-AI-LabEnables pixel-space reasoning for Vision-Language Models
Top 97.7% on SourcePulse
Pixel Reasoner addresses LLMs' limitation to textual reasoning, hindering visual tasks. It equips VLMs with pixel-space reasoning via operations like zoom-in and frame selection, enhancing reasoning fidelity for visual tasks. This framework benefits researchers and developers by enabling robust visual understanding and achieving state-of-the-art benchmark results.
How It Works
The project uses a two-phase training: instruction tuning familiarizes VLMs with visual operations using synthesized traces, followed by curiosity-driven reinforcement learning (RL) to balance pixel-space and textual reasoning. Adapted from VL-Rethinker (RL) and Open-R1 (SFT), this methodology allows VLMs to proactively gather information from complex visual inputs, significantly improving performance across diverse visual reasoning benchmarks.
Quick Start & Requirements
instruction_tuning, curiosity_driven_rl) and follow detailed setup guides.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
ValueError if total prompt/multimodal tokens exceed model context length (e.g., 10240), requiring adjustments to token limits or image handling.dtype mismatches between transformers and vLLM may necessitate specific library reinstalls.logp_bsz=1 due to model.generate() limitations with batch sizes > 1.MAX_PIXELS, MIN_PIXELS) and correct environment variable propagation are critical for reproducing results.2 months ago
Inactive
LLaVA-VL
zai-org