OMG-Seg  by lxtGH

Vision model research combining visual perception, reasoning, and multi-modal language tasks

Created 1 year ago
1,327 stars

Top 30.3% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides codebases for OMG-LLaVA and OMG-Seg, aiming to unify multiple visual perception and reasoning tasks into single models. OMG-LLaVA bridges image, object, and pixel-level understanding for multimodal large language model applications, while OMG-Seg offers a universal solution for over ten distinct segmentation tasks, including image, video, and open-vocabulary settings. The target audience includes researchers and practitioners in computer vision and multimodal AI seeking efficient, high-performance, and unified solutions.

How It Works

OMG-LLaVA integrates a universal segmentation method as a visual encoder, converting visual information and prompts into tokens for a large language model (LLM). The LLM handles text instructions, generating text responses and pixel-level segmentation outputs. This approach enables end-to-end training of a single encoder, decoder, and LLM for diverse multimodal tasks. OMG-Seg employs a transformer-based encoder-decoder architecture with task-specific queries and outputs, allowing a single model to handle numerous segmentation tasks efficiently with reduced computational overhead.

Quick Start & Requirements

  • OMG-Seg: Training code and models are released. Reproduction is possible on a single 32GB V100 or 40GB A100 machine.
  • OMG-LLaVA: Test code and 7B models are released. Full code is available.
  • Supports HuggingFace integration.
  • Official project pages and arXiv links are provided for detailed information.

Highlighted Details

  • OMG-LLaVA achieves image, object, and pixel-level reasoning in one model, matching or surpassing specialized methods.
  • OMG-Seg supports over ten segmentation tasks (semantic, instance, panoptic, video, open-vocabulary, interactive) with a unified architecture and ~70M trainable parameters.
  • Both codebases are open-sourced, including training, inference, and demo scripts.
  • OMG-LLaVA is accepted by NeurIPS-2024; OMG-Seg is accepted by CVPR-2024.

Maintenance & Community

The project has released code and checkpoints for both OMG-LLaVA and OMG-Seg. HuggingFace support is available. Links to project pages and arXiv papers are provided.

Licensing & Compatibility

  • OMG-Seg follows the S-Lab LICENSE.
  • OMG-LLaVA follows the Apache-2.0 license, respecting LLaVA and XTuner.

Limitations & Caveats

The README mentions a to-do list item for adding more easy-to-use tutorials. Specific performance benchmarks for all supported tasks are not exhaustively detailed within the main README.

Health Check
Last Commit

2 weeks ago

Responsiveness

1 week

Pull Requests (30d)
1
Issues (30d)
1
Star History
5 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.