ZegCLIP  by ZiqinZhou66

CLIP-based zero-shot semantic segmentation

Created 3 years ago
254 stars

Top 99.1% on SourcePulse

GitHubView on GitHub
Project Summary

Zero-shot semantic segmentation is addressed by ZegCLIP, which offers an efficient one-stage adaptation of CLIP, moving beyond complex two-stage pipelines. This approach directly extends CLIP's image-level zero-shot capabilities to the pixel level, providing a simpler, faster, and more performant solution for computer vision researchers and practitioners.

How It Works

ZegCLIP employs a one-stage strategy by comparing text and patch embeddings extracted from CLIP. To mitigate overfitting on seen classes and enhance generalization to unseen classes, the project introduces three effective design modifications. This approach retains CLIP's inherent zero-shot capacity while significantly improving pixel-level performance, avoiding the computational overhead associated with multi-encoder architectures used in prior two-stage methods.

Quick Start & Requirements

Installation can be achieved via Conda/Pip or Docker.

  • Conda/Pip: Requires specific versions: PyTorch 1.10.1, torchvision 0.11.2, torchaudio 0.10.1, cudatoolkit 10.2, mmcv-full 1.4.4, mmsegmentation 0.24.0, SciPy, and timm 0.3.2.
  • Docker: Utilize the ziqinzhou/zegclip:latest image. Dataset preparation should follow MMsegmentation guidelines. A pretrained CLIP ViT-B-16 model is essential, downloadable from https://openaipublic.azureedge.net/clip/models/5806e77cd80f8b59890b7e101eabd078d9fb84e6937f9e85e4ecb61988df416f/ViT-B-16.pt. Experiments are typically run on a single GPU (e.g., 1080Ti).

Highlighted Details

  • Achieves state-of-the-art performance on PASCAL VOC 2012 and COCO Stuff 164K benchmarks across inductive and transductive zero-shot settings.
  • Delivers approximately 5x faster inference speeds compared to previous two-stage methods.
  • Models are relatively lightweight, featuring ~13.8-14.6M parameters and ~110.4-123.9G FLOPs.
  • Provides comprehensive benchmark results including pAcc, mIoU(S), mIoU(U), and hIoU. Pretrained model zoos are accessible via Google Drive links detailed in the README.

Maintenance & Community

No specific community channels (e.g., Discord, Slack) or roadmap details are provided in the README.

Licensing & Compatibility

The README does not explicitly state the project's license.

Limitations & Caveats

The README does not detail specific limitations or known issues. The core approach is designed to address challenges related to generalization and overfitting inherent in adapting CLIP to pixel-level tasks.

Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 30 days

Explore Similar Projects

Starred by Jianwei Yang Jianwei Yang(Research Scientist at Meta Superintelligence Lab), Jiaming Song Jiaming Song(Chief Scientist at Luma AI), and
5 more.

X-Decoder by microsoft

0%
1k
Generalized decoding model for pixel, image, and language tasks
Created 3 years ago
Updated 2 years ago
Feedback? Help us improve.