X-Decoder  by microsoft

Generalized decoding model for pixel, image, and language tasks

created 2 years ago
1,324 stars

Top 30.9% on sourcepulse

GitHubView on GitHub
Project Summary

X-Decoder is a generalized decoding model designed for unified pixel-level segmentation and token-level text generation across various vision and language tasks. It offers state-of-the-art performance on open-vocabulary and referring segmentation, and can be flexibly finetuned for tasks like image captioning, retrieval, and visual question answering.

How It Works

X-Decoder leverages a unified architecture that seamlessly integrates pixel and text decoding. It builds upon Mask2Former, enabling it to handle diverse tasks such as semantic, instance, and panoptic segmentation, as well as image captioning and retrieval, with a single set of pretrained parameters. This approach allows for zero-shot task composition, facilitating novel applications like region retrieval and image editing.

Quick Start & Requirements

  • Install: git clone git@github.com:UX-Decoder/Segment-Everything-Everywhere-All-At-Once.git && sh aasets/scripts/run_demo.sh
  • Prerequisites: Python, PyTorch. Specific requirements detailed in INSTALL.md.
  • Resources: Model checkpoints and comprehensive user guides are available.
  • Demos: HuggingFace All-in-One Demo, HuggingFace Instruct Demo

Highlighted Details

  • Achieves state-of-the-art results on open-vocabulary segmentation and referring segmentation across eight datasets.
  • Supports zero-shot task composition for region retrieval, referring captioning, and image editing.
  • Offers unified pretrained parameters for semantic, instance, panoptic segmentation, image captioning, and image-text retrieval.
  • Includes companion models like SEEM (Segment Everything Everywhere All At Once) for interactive segmentation.

Maintenance & Community

The project is associated with CVPR 2023 and has seen recent updates including training/evaluation code and new checkpoints. Related projects like OpenSeeD and X-GPT are also mentioned.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README.

Limitations & Caveats

The README does not specify licensing details, which may impact commercial use or integration into closed-source projects. The project is presented as an official implementation of a CVPR 2023 paper, suggesting a focus on research and academic use.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
12 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
4 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
created 2 years ago
updated 11 months ago
Feedback? Help us improve.