OMG-Seg  by lxtGH

Vision model research combining visual perception, reasoning, and multi-modal language tasks

created 1 year ago
1,314 stars

Top 31.1% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides codebases for OMG-LLaVA and OMG-Seg, aiming to unify multiple visual perception and reasoning tasks into single models. OMG-LLaVA bridges image, object, and pixel-level understanding for multimodal large language model applications, while OMG-Seg offers a universal solution for over ten distinct segmentation tasks, including image, video, and open-vocabulary settings. The target audience includes researchers and practitioners in computer vision and multimodal AI seeking efficient, high-performance, and unified solutions.

How It Works

OMG-LLaVA integrates a universal segmentation method as a visual encoder, converting visual information and prompts into tokens for a large language model (LLM). The LLM handles text instructions, generating text responses and pixel-level segmentation outputs. This approach enables end-to-end training of a single encoder, decoder, and LLM for diverse multimodal tasks. OMG-Seg employs a transformer-based encoder-decoder architecture with task-specific queries and outputs, allowing a single model to handle numerous segmentation tasks efficiently with reduced computational overhead.

Quick Start & Requirements

  • OMG-Seg: Training code and models are released. Reproduction is possible on a single 32GB V100 or 40GB A100 machine.
  • OMG-LLaVA: Test code and 7B models are released. Full code is available.
  • Supports HuggingFace integration.
  • Official project pages and arXiv links are provided for detailed information.

Highlighted Details

  • OMG-LLaVA achieves image, object, and pixel-level reasoning in one model, matching or surpassing specialized methods.
  • OMG-Seg supports over ten segmentation tasks (semantic, instance, panoptic, video, open-vocabulary, interactive) with a unified architecture and ~70M trainable parameters.
  • Both codebases are open-sourced, including training, inference, and demo scripts.
  • OMG-LLaVA is accepted by NeurIPS-2024; OMG-Seg is accepted by CVPR-2024.

Maintenance & Community

The project has released code and checkpoints for both OMG-LLaVA and OMG-Seg. HuggingFace support is available. Links to project pages and arXiv papers are provided.

Licensing & Compatibility

  • OMG-Seg follows the S-Lab LICENSE.
  • OMG-LLaVA follows the Apache-2.0 license, respecting LLaVA and XTuner.

Limitations & Caveats

The README mentions a to-do list item for adding more easy-to-use tutorials. Specific performance benchmarks for all supported tasks are not exhaustively detailed within the main README.

Health Check
Last commit

2 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
37 stars in the last 90 days

Explore Similar Projects

Starred by Travis Fischer Travis Fischer(Founder of Agentic), Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), and
9 more.

LLaVA by haotian-liu

0.3%
23k
Multimodal assistant with GPT-4 level capabilities
created 2 years ago
updated 11 months ago
Feedback? Help us improve.