PixelLM by MaverickRen

LMM for pixel-level image reasoning and segmentation

Created 2 years ago

273 stars

Top 94.3% on SourcePulse

Project Summary

Large Multimodal Models (LMMs) often struggle with pixel-level reasoning and understanding, especially for tasks involving arbitrary numbers of open-set targets. PixelLM addresses this by providing an effective and efficient LMM solution for pixel-level reasoning and understanding. It enables precise mask generation for complex segmentation tasks without requiring additional, costly segmentation models, thereby enhancing efficiency and transferability to diverse applications.

How It Works

PixelLM integrates a novel, lightweight pixel decoder and a comprehensive segmentation codebook into a standard LMM architecture. This design allows it to efficiently produce masks from the hidden embeddings of codebook tokens, which encode detailed target-relevant information. This approach avoids the need for separate, computationally expensive segmentation models. Additionally, a target refinement loss is incorporated to improve the model's capability to differentiate between multiple targets, leading to higher quality masks.

Quick Start & Requirements

Installation: pip install -r requirements.txt.
Prerequisites: Requires LLaVA pre-trained weights (e.g., LLaVA-Lightning-7B-v1-1 for PixelLM-7B, liuhaotian/llava-llama-2-13b-chat-lightning-preview for PixelLM-13B). Extensive dataset preparation is necessary, including the custom MUSE dataset and potentially COCO, ADE20K, and others, organized as specified in the README.
Hardware: Training commands utilize deepspeed, suggesting distributed training capabilities. Inference is performed via chat.py. Specific vision towers like openai/clip-vit-large-patch14-336 are mentioned.
Links: The README mentions direct links for Paper, Models, Training, Inference, Dataset, and Project Page, but the URLs themselves are not provided in the text.

Highlighted Details

Accepted to CVPR 2024.
Introduces MUSE, a novel, high-quality multi-target reasoning segmentation dataset curated using a GPT-4V-aided pipeline, featuring 246k question-answer pairs and 0.9 million instances.
Achieves new state-of-the-art results on various pixel-level reasoning benchmarks, including MUSE, single-, and multi-referring segmentation.
Features a unique, lightweight pixel decoder and segmentation codebook for efficient mask generation.

Maintenance & Community

The README does not provide specific details on community channels (like Discord/Slack), a public roadmap, or dedicated maintainer information beyond institutional affiliations (Beijing Jiaotong University, University of Science and Technology Beijing, ByteDance, Peng Cheng Laboratory).

Licensing & Compatibility

The project's license is not explicitly stated in the README. This absence prevents an assessment of its compatibility for commercial use or integration within closed-source projects.

Limitations & Caveats

The setup process requires significant effort in data preparation and dependency management, including integration with LLaVA and potentially large datasets. The unspecified license is a notable adoption blocker. The project builds upon LLaVA and LISA, implying potential inheritance of their respective limitations or dependencies.

PixelLM by MaverickRen

Explore Similar Projects

dots.vlm1 by rednote-hilab

NExT-Chat by NExT-ChatV

X-SAM by wanghao9610

PixelRefer by alibaba-damo-academy

dinov3-finetune by RobvanGastel

Awesome-Open-Vocabulary-Semantic-Segmentation by Qinying-Liu

Osprey by CircleRadon

Falcon-Perception by tiiuae

X-Decoder by microsoft

LISA by JIA-Lab-research

FastSAM by CASIA-LMC-Lab

Pytorch-UNet by milesial