Pixel-Reasoner by TIGER-AI-Lab

Enables pixel-space reasoning for Vision-Language Models

Created 9 months ago

278 stars

Top 93.6% on SourcePulse

Project Summary

Pixel Reasoner addresses LLMs' limitation to textual reasoning, hindering visual tasks. It equips VLMs with pixel-space reasoning via operations like zoom-in and frame selection, enhancing reasoning fidelity for visual tasks. This framework benefits researchers and developers by enabling robust visual understanding and achieving state-of-the-art benchmark results.

How It Works

The project uses a two-phase training: instruction tuning familiarizes VLMs with visual operations using synthesized traces, followed by curiosity-driven reinforcement learning (RL) to balance pixel-space and textual reasoning. Adapted from VL-Rethinker (RL) and Open-R1 (SFT), this methodology allows VLMs to proactively gather information from complex visual inputs, significantly improving performance across diverse visual reasoning benchmarks.

Quick Start & Requirements

Installation: Navigate to specific directories (instruction_tuning, curiosity_driven_rl) and follow detailed setup guides.
Prerequisites: Implies GPU/CUDA support for RL and vLLM inference. Dependencies include Hugging Face libraries. Specific Python versions are not stated.
Models & Data: Pre-trained models and datasets are available on Hugging Face.
Resources: Multinode training commands suggest substantial computational resources are necessary.

Highlighted Details

Achieves state-of-the-art accuracy on visual reasoning benchmarks: 84% (V* bench), 74% (TallyQA-Complex), 84% (InfographicsVQA) with its 7B model.
Supports instruction tuning and RL training with multi-turn trajectories.
Handles mixed video and image data for RL training.
Leverages vLLM for efficient inference and evaluation.

Maintenance & Community

Contacts: Haozhe (jasper.whz@outlook.com) for RL issues, Muze (dlwlrma314516@gmail.com) for SFT and VLM-R1.
Community: No explicit community channels or roadmap links provided.

Licensing & Compatibility

License: The specific open-source license is not mentioned in the provided README text.
Compatibility: Integrates components adapted from Open-R1 and VL-Rethinker. Setup instructions provided for vLLM and Hugging Face transformers.

Limitations & Caveats

Context Length: Prone to ValueError if total prompt/multimodal tokens exceed model context length (e.g., 10240), requiring adjustments to token limits or image handling.
Dependency Versions: Potential dtype mismatches between transformers and vLLM may necessitate specific library reinstalls.
Logprobs Calculation: Requires logp_bsz=1 due to model.generate() limitations with batch sizes > 1.
Reproducibility: Precise pixel token settings (MAX_PIXELS, MIN_PIXELS) and correct environment variable propagation are critical for reproducing results.

Health Check

Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

10 stars in the last 30 days

Explore Similar Projects

dots.vlm1 by rednote-hilab

Advanced vision-language model for complex multimodal tasks

Created 6 months ago

Updated 5 months ago

Spatial-MLLM by diankun-wu

Boosting MLLM spatial intelligence with video input

Created 9 months ago

Updated 3 weeks ago

Oryx by Oryx-mllm

MLLM research paper for on-demand spatial-temporal understanding

Created 1 year ago

Updated 7 months ago

MoAI by ByungKwanLee

PyTorch code for vision-language model (VLM) research

Created 1 year ago

Updated 1 year ago

Thyme by yfzhang114

Multimodal reasoning and code execution for complex visual tasks

Created 6 months ago

Updated 5 months ago

Osprey by CircleRadon

Research paper for pixel understanding via visual instruction tuning

Created 2 years ago

Updated 6 months ago

OMG-Seg by lxtGH

Vision model research combining visual perception, reasoning, and multi-modal language tasks

Created 2 years ago

Updated 4 months ago

Starred by

Haotian Liu

Haotian Liu(Author of LLaVA; Research Scientist at xAI) and

Pawel Garbacki

Pawel Garbacki(Cofounder of Fireworks AI).

LLaVA-Plus-Codebase by LLaVA-VL

Multimodal agent for vision tasks using external tools

Created 2 years ago

Updated 2 years ago

Seed1.5-VL by ByteDance-Seed

Vision-language foundation model for multimodal understanding/reasoning

Created 9 months ago

Updated 8 months ago

Rex-Omni by IDEA-Research

Multimodal LLM for versatile visual perception via next-point prediction

Created 4 months ago

Updated 4 days ago

Starred by

Alex Yu

Alex Yu(Research Scientist at OpenAI; Cofounder of Luma AI),

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI), and

7 more.

CogVLM by zai-org

VLM for image understanding and multi-turn dialogue

Created 2 years ago

Updated 1 year ago

Starred by

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera) and

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory).

InternVL by OpenGVLab

Open-source MLLM alternative to GPT-4o

Created 2 years ago

Updated 5 months ago

Feedback? Help us improve.