PAM  by Perceive-Anything

Comprehensive region-level visual understanding for images and videos

created 2 months ago
272 stars

Top 94.7% on SourcePulse

GitHubView on GitHub
Project Summary

Perceive Anything Model (PAM) is a region-level vision-language model (VLM) designed for comprehensive understanding of images and videos. It extends SAM 2 by integrating LLMs to perform simultaneous object segmentation and generate diverse, region-specific semantic outputs like categories, definitions, functional explanations, and detailed captions. This framework targets researchers and developers working on advanced visual understanding tasks.

How It Works

PAM efficiently transforms SAM 2's rich visual features into multi-modal tokens comprehensible by LLMs. This approach leverages SAM 2's inherent general vision, localization, and semantic priors. The model is supported by a dedicated data pipeline that refines and augments annotations, including novel region-level streaming video caption data, enabling robust multi-granularity understanding.

Quick Start & Requirements

  • Installation: Clone the repository, create a conda environment (conda create -n PAM python=3.10), activate it, and install dependencies using pip install -e ".[train]". SAM2 and Flash-Attention installation are also detailed.
  • Prerequisites: Python 3.10, CUDA (for Flash-Attention), SAM2.1-h-large checkpoint.
  • Resources: Model weights (1.5B / 3B) and datasets are available.
  • Demos: Examples are provided in image_infer_example.ipynb and video_infer_example.ipynb.

Highlighted Details

  • End-to-end region-level VLM framework.
  • Integrates LLMs with SAM 2 for segmentation and semantic understanding.
  • Supports image, video, and video stream processing.
  • Includes a refined dataset with region-level streaming video caption data.

Maintenance & Community

The project is associated with multiple institutions (CUHK, HKU, PolyU, PekingU). Model weights and datasets were released on June 8, 2025. Links to a project website, paper, model downloads, dataset, and citation are provided.

Licensing & Compatibility

Licensed under Apache 2.0.

Limitations & Caveats

Source images for the dataset are not directly provided, but download links or official website addresses are given. A local Gradio demo is in progress.

Health Check
Last commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
2
Star History
27 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.