Discover and explore top open-source AI tools and projects—updated daily.
Gorilla-Lab-SCUTResearch implementation for unified MLLM visual output
Top 99.9% on SourcePulse
Patch-as-Decodable-Token (PaDT) addresses the limitations of Multimodal Large Language Models (MLLMs) in directly generating visual outputs and performing semantic reasoning on visual tasks. It introduces Visual Reference Tokens (VRTs) to enable MLLMs to process and generate visual information alongside text, offering a unified paradigm for tasks like object detection, segmentation, and image captioning. This approach benefits users by providing state-of-the-art performance and more direct visual reasoning capabilities within MLLMs.
How It Works
PaDT's core innovation lies in Visual Reference Tokens (VRTs), which allow MLLMs to treat visual patches as decodable tokens, bypassing the need for less semantic text-based coordinates. This enables native vision-language alignment and a unified token space for seamless multimodal reasoning. A lightweight, plug-and-play decoder then translates these predicted visual tokens into precise low-level outputs such as bounding boxes or segmentation maps. This design preserves the LLM's semantic reasoning while adding robust spatial output capabilities.
Quick Start & Requirements
Installation involves cloning the repository, creating a Conda environment with Python 3.11, activating it, and running bash setup.sh. The project requires GPU acceleration, as indicated by the demo code using torch.bfloat16 and device_map={"": 0}. Key dependencies include PyTorch and the Transformers library.
https://github.com/Gorilla-Lab-SCUT/PaDT.gitPaDT-MLLM/COCO, PaDT-MLLM/RefCOCO, PaDT-MLLM/ReferringImageCaptioning).https://arxiv.org/abs/2510.01954eval/test_demo.pyHighlighted Details
PaDT_Pro model demonstrates task-switching capabilities via prompts and outperforms single-task models when jointly trained.Maintenance & Community
The project is associated with authors listed in the ICLR 2026 paper citation. As of October 31, 2025, evaluation scripts were being updated. No specific community channels (e.g., Discord, Slack) or roadmap links are provided in the README.
Licensing & Compatibility
PaDT is licensed under the Apache 2.0 license. This license is permissive and generally compatible with commercial use and linking within closed-source projects.
Limitations & Caveats
The README indicates that evaluation scripts are still being updated, suggesting potential ongoing development or refinement in the evaluation pipeline. The project's association with ICLR 2026 implies it is a recent development, and users should anticipate potential evolution. No other explicit limitations are detailed.
4 months ago
Inactive