Pixel-Reasoner  by TIGER-AI-Lab

Enables pixel-space reasoning for Vision-Language Models

Created 7 months ago
260 stars

Top 97.7% on SourcePulse

GitHubView on GitHub
Project Summary

Pixel Reasoner addresses LLMs' limitation to textual reasoning, hindering visual tasks. It equips VLMs with pixel-space reasoning via operations like zoom-in and frame selection, enhancing reasoning fidelity for visual tasks. This framework benefits researchers and developers by enabling robust visual understanding and achieving state-of-the-art benchmark results.

How It Works

The project uses a two-phase training: instruction tuning familiarizes VLMs with visual operations using synthesized traces, followed by curiosity-driven reinforcement learning (RL) to balance pixel-space and textual reasoning. Adapted from VL-Rethinker (RL) and Open-R1 (SFT), this methodology allows VLMs to proactively gather information from complex visual inputs, significantly improving performance across diverse visual reasoning benchmarks.

Quick Start & Requirements

  • Installation: Navigate to specific directories (instruction_tuning, curiosity_driven_rl) and follow detailed setup guides.
  • Prerequisites: Implies GPU/CUDA support for RL and vLLM inference. Dependencies include Hugging Face libraries. Specific Python versions are not stated.
  • Models & Data: Pre-trained models and datasets are available on Hugging Face.
  • Resources: Multinode training commands suggest substantial computational resources are necessary.

Highlighted Details

  • Achieves state-of-the-art accuracy on visual reasoning benchmarks: 84% (V* bench), 74% (TallyQA-Complex), 84% (InfographicsVQA) with its 7B model.
  • Supports instruction tuning and RL training with multi-turn trajectories.
  • Handles mixed video and image data for RL training.
  • Leverages vLLM for efficient inference and evaluation.

Maintenance & Community

Licensing & Compatibility

  • License: The specific open-source license is not mentioned in the provided README text.
  • Compatibility: Integrates components adapted from Open-R1 and VL-Rethinker. Setup instructions provided for vLLM and Hugging Face transformers.

Limitations & Caveats

  • Context Length: Prone to ValueError if total prompt/multimodal tokens exceed model context length (e.g., 10240), requiring adjustments to token limits or image handling.
  • Dependency Versions: Potential dtype mismatches between transformers and vLLM may necessitate specific library reinstalls.
  • Logprobs Calculation: Requires logp_bsz=1 due to model.generate() limitations with batch sizes > 1.
  • Reproducibility: Precise pixel token settings (MAX_PIXELS, MIN_PIXELS) and correct environment variable propagation are critical for reproducing results.
Health Check
Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
5
Star History
6 stars in the last 30 days

Explore Similar Projects

Starred by Alex Yu Alex Yu(Research Scientist at OpenAI; Cofounder of Luma AI), Elvis Saravia Elvis Saravia(Founder of DAIR.AI), and
7 more.

CogVLM by zai-org

0.0%
7k
VLM for image understanding and multi-turn dialogue
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.