VisionReasoner  by dvlab-research

Unified visual perception and reasoning framework

Created 5 months ago
269 stars

Top 95.5% on SourcePulse

GitHubView on GitHub
Project Summary

VisionReasoner is a unified framework designed to advance Visual-Language Models (VLMs) by enabling a single model to perform diverse visual perception tasks, including detection, segmentation, counting, and visual question answering (VQA). It targets researchers and developers looking to push the boundaries of VLM capabilities beyond traditional captioning and QA, offering a versatile AI assistant for a wide array of visual understanding challenges.

How It Works

VisionReasoner employs a unified architecture that integrates a reasoning module for object localization and a segmentation module for mask generation. A key component is the task router, which categorizes diverse vision tasks into four fundamental types: detection, segmentation, counting, and VQA. This approach, combined with carefully crafted rewards and a specific training strategy, allows the model to achieve strong multi-task performance within a single, cohesive framework, outperforming specialized baseline models across ten different visual perception tasks.

Quick Start & Requirements

  • Installation: Clone the repository, create a conda environment (conda create -n visionreasoner_test python=3.12), activate it, and install dependencies (pip3 install torch torchvision, pip install -r requirements.txt).
  • Prerequisites: Python 3.12, PyTorch, Hugging Face models (VisionReasoner-7B, TaskRouter-1.5B). GPU is recommended for efficient inference and training.
  • Inference: Download models using git lfs and run inference with python vision_reasoner/inference.py.
  • Docs/Demos: Links to Hugging Face models for VisionReasoner and TaskRouter are provided.

Highlighted Details

  • Achieves superior performance across ten diverse visual perception tasks within a single unified framework.
  • Supports a hybrid mode that intelligently switches between direct detection (YOLO-World) and reasoning-based approaches for optimized response times.
  • Incorporates image generation capabilities using models like gpt-image-1.
  • Provides evaluation scripts for segmentation, detection, and counting tasks on various datasets (COCO, RefCOCO, etc.).

Maintenance & Community

The project is associated with dvlab-research and builds upon prior work like Seg-Zero, EasyR1, and veRL. It utilizes models from Qwen2-VL, Qwen2.5-VL, SAM2, and YOLO-World. The primary citation is for the VisionReasoner paper (arXiv:2505.12081) and Seg-Zero (arXiv:2503.06520).

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Compatibility for commercial use or closed-source linking would require clarification on the licensing terms of the project and its underlying models.

Limitations & Caveats

The README notes that bugs might arise from API version mismatches, requiring users to debug and customize based on their specific API keys and versions. The project is presented as a research advancement, and its readiness for production environments or extensive commercial use is not detailed.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
3
Star History
21 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.