Comprehensive region-level visual understanding for images and videos
Top 94.7% on SourcePulse
Perceive Anything Model (PAM) is a region-level vision-language model (VLM) designed for comprehensive understanding of images and videos. It extends SAM 2 by integrating LLMs to perform simultaneous object segmentation and generate diverse, region-specific semantic outputs like categories, definitions, functional explanations, and detailed captions. This framework targets researchers and developers working on advanced visual understanding tasks.
How It Works
PAM efficiently transforms SAM 2's rich visual features into multi-modal tokens comprehensible by LLMs. This approach leverages SAM 2's inherent general vision, localization, and semantic priors. The model is supported by a dedicated data pipeline that refines and augments annotations, including novel region-level streaming video caption data, enabling robust multi-granularity understanding.
Quick Start & Requirements
conda create -n PAM python=3.10
), activate it, and install dependencies using pip install -e ".[train]"
. SAM2 and Flash-Attention installation are also detailed.image_infer_example.ipynb
and video_infer_example.ipynb
.Highlighted Details
Maintenance & Community
The project is associated with multiple institutions (CUHK, HKU, PolyU, PekingU). Model weights and datasets were released on June 8, 2025. Links to a project website, paper, model downloads, dataset, and citation are provided.
Licensing & Compatibility
Licensed under Apache 2.0.
Limitations & Caveats
Source images for the dataset are not directly provided, but download links or official website addresses are given. A local Gradio demo is in progress.
1 month ago
Inactive