PaDT by Gorilla-Lab-SCUT

Research implementation for unified MLLM visual output

Created 5 months ago

251 stars

Top 99.9% on SourcePulse

Project Summary

Patch-as-Decodable-Token (PaDT) addresses the limitations of Multimodal Large Language Models (MLLMs) in directly generating visual outputs and performing semantic reasoning on visual tasks. It introduces Visual Reference Tokens (VRTs) to enable MLLMs to process and generate visual information alongside text, offering a unified paradigm for tasks like object detection, segmentation, and image captioning. This approach benefits users by providing state-of-the-art performance and more direct visual reasoning capabilities within MLLMs.

How It Works

PaDT's core innovation lies in Visual Reference Tokens (VRTs), which allow MLLMs to treat visual patches as decodable tokens, bypassing the need for less semantic text-based coordinates. This enables native vision-language alignment and a unified token space for seamless multimodal reasoning. A lightweight, plug-and-play decoder then translates these predicted visual tokens into precise low-level outputs such as bounding boxes or segmentation maps. This design preserves the LLM's semantic reasoning while adding robust spatial output capabilities.

Quick Start & Requirements

Installation involves cloning the repository, creating a Conda environment with Python 3.11, activating it, and running bash setup.sh. The project requires GPU acceleration, as indicated by the demo code using torch.bfloat16 and device_map={"": 0}. Key dependencies include PyTorch and the Transformers library.

Code Repository: https://github.com/Gorilla-Lab-SCUT/PaDT.git
Preprocessed Datasets: Available on Hugging Face (PaDT-MLLM/COCO, PaDT-MLLM/RefCOCO, PaDT-MLLM/ReferringImageCaptioning).
Tech Report: https://arxiv.org/abs/2510.01954
Demo Script: eval/test_demo.py

Highlighted Details

Unified Output Generation: Enables MLLMs to directly generate both textual and visual outputs within a single framework.
State-of-the-Art Performance: Achieves SOTA across four major visual tasks: Referring Expression Comprehension/Segmentation (REC/RES), Open Vocabulary Detection (OVD), and Referring Image Captioning (RIC).
Multi-Task Generalization: The PaDT_Pro model demonstrates task-switching capabilities via prompts and outperforms single-task models when jointly trained.
Lightweight Decoding: Decouples dense prediction from the core LLM, preserving its reasoning power while adding precise spatial output functionality.

Maintenance & Community

The project is associated with authors listed in the ICLR 2026 paper citation. As of October 31, 2025, evaluation scripts were being updated. No specific community channels (e.g., Discord, Slack) or roadmap links are provided in the README.

Licensing & Compatibility

PaDT is licensed under the Apache 2.0 license. This license is permissive and generally compatible with commercial use and linking within closed-source projects.

Limitations & Caveats

The README indicates that evaluation scripts are still being updated, suggesting potential ongoing development or refinement in the evaluation pipeline. The project's association with ICLR 2026 implies it is a recent development, and users should anticipate potential evolution. No other explicit limitations are detailed.

Health Check

Last Commit

4 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days