VisionReasoner by JIA-Lab-research

Unified visual perception and reasoning framework

Created 8 months ago

307 stars

Top 87.5% on SourcePulse

Project Summary

VisionReasoner is a unified framework designed to advance Visual-Language Models (VLMs) by enabling a single model to perform diverse visual perception tasks, including detection, segmentation, counting, and visual question answering (VQA). It targets researchers and developers looking to push the boundaries of VLM capabilities beyond traditional captioning and QA, offering a versatile AI assistant for a wide array of visual understanding challenges.

How It Works

VisionReasoner employs a unified architecture that integrates a reasoning module for object localization and a segmentation module for mask generation. A key component is the task router, which categorizes diverse vision tasks into four fundamental types: detection, segmentation, counting, and VQA. This approach, combined with carefully crafted rewards and a specific training strategy, allows the model to achieve strong multi-task performance within a single, cohesive framework, outperforming specialized baseline models across ten different visual perception tasks.

Quick Start & Requirements

Installation: Clone the repository, create a conda environment (conda create -n visionreasoner_test python=3.12), activate it, and install dependencies (pip3 install torch torchvision, pip install -r requirements.txt).
Prerequisites: Python 3.12, PyTorch, Hugging Face models (VisionReasoner-7B, TaskRouter-1.5B). GPU is recommended for efficient inference and training.
Inference: Download models using git lfs and run inference with python vision_reasoner/inference.py.
Docs/Demos: Links to Hugging Face models for VisionReasoner and TaskRouter are provided.

Highlighted Details

Achieves superior performance across ten diverse visual perception tasks within a single unified framework.
Supports a hybrid mode that intelligently switches between direct detection (YOLO-World) and reasoning-based approaches for optimized response times.
Incorporates image generation capabilities using models like gpt-image-1.
Provides evaluation scripts for segmentation, detection, and counting tasks on various datasets (COCO, RefCOCO, etc.).

Maintenance & Community

The project is associated with dvlab-research and builds upon prior work like Seg-Zero, EasyR1, and veRL. It utilizes models from Qwen2-VL, Qwen2.5-VL, SAM2, and YOLO-World. The primary citation is for the VisionReasoner paper (arXiv:2505.12081) and Seg-Zero (arXiv:2503.06520).

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Compatibility for commercial use or closed-source linking would require clarification on the licensing terms of the project and its underlying models.

Limitations & Caveats

The README notes that bugs might arise from API version mismatches, requiring users to debug and customize based on their specific API keys and versions. The project is presented as a research advancement, and its readiness for production environments or extensive commercial use is not detailed.

VisionReasoner by JIA-Lab-research

Explore Similar Projects

OneThinker by tulerfeng

MiMo-VL by XiaomiMiMo

Lumina-mGPT by Alpha-VLLM

Thyme by yfzhang114

OMG-Seg by lxtGH

Seed1.5-VL by ByteDance-Seed

LLaVA-Plus-Codebase by LLaVA-VL

VisionLLM by OpenGVLab

LVM by ytongbai

Rex-Omni by IDEA-Research

Vary by Ucas-HaoranWei

DeepSeek-VL by deepseek-ai