OMG-Seg  by lxtGH

Vision model research combining visual perception, reasoning, and multi-modal language tasks

Created 1 year ago
1,322 stars

Top 30.3% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides codebases for OMG-LLaVA and OMG-Seg, aiming to unify multiple visual perception and reasoning tasks into single models. OMG-LLaVA bridges image, object, and pixel-level understanding for multimodal large language model applications, while OMG-Seg offers a universal solution for over ten distinct segmentation tasks, including image, video, and open-vocabulary settings. The target audience includes researchers and practitioners in computer vision and multimodal AI seeking efficient, high-performance, and unified solutions.

How It Works

OMG-LLaVA integrates a universal segmentation method as a visual encoder, converting visual information and prompts into tokens for a large language model (LLM). The LLM handles text instructions, generating text responses and pixel-level segmentation outputs. This approach enables end-to-end training of a single encoder, decoder, and LLM for diverse multimodal tasks. OMG-Seg employs a transformer-based encoder-decoder architecture with task-specific queries and outputs, allowing a single model to handle numerous segmentation tasks efficiently with reduced computational overhead.

Quick Start & Requirements

  • OMG-Seg: Training code and models are released. Reproduction is possible on a single 32GB V100 or 40GB A100 machine.
  • OMG-LLaVA: Test code and 7B models are released. Full code is available.
  • Supports HuggingFace integration.
  • Official project pages and arXiv links are provided for detailed information.

Highlighted Details

  • OMG-LLaVA achieves image, object, and pixel-level reasoning in one model, matching or surpassing specialized methods.
  • OMG-Seg supports over ten segmentation tasks (semantic, instance, panoptic, video, open-vocabulary, interactive) with a unified architecture and ~70M trainable parameters.
  • Both codebases are open-sourced, including training, inference, and demo scripts.
  • OMG-LLaVA is accepted by NeurIPS-2024; OMG-Seg is accepted by CVPR-2024.

Maintenance & Community

The project has released code and checkpoints for both OMG-LLaVA and OMG-Seg. HuggingFace support is available. Links to project pages and arXiv papers are provided.

Licensing & Compatibility

  • OMG-Seg follows the S-Lab LICENSE.
  • OMG-LLaVA follows the Apache-2.0 license, respecting LLaVA and XTuner.

Limitations & Caveats

The README mentions a to-do list item for adding more easy-to-use tutorials. Specific performance benchmarks for all supported tasks are not exhaustively detailed within the main README.

Health Check
Last Commit

3 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
1
Star History
10 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Elvis Saravia Elvis Saravia(Founder of DAIR.AI).

DeepSeek-VL2 by deepseek-ai

0.1%
5k
MoE vision-language model for multimodal understanding
Created 9 months ago
Updated 6 months ago
Feedback? Help us improve.