OMG-Seg by lxtGH

Vision model research combining visual perception, reasoning, and multi-modal language tasks

Created 2 years ago

1,338 stars

Top 29.8% on SourcePulse

Project Summary

This repository provides codebases for OMG-LLaVA and OMG-Seg, aiming to unify multiple visual perception and reasoning tasks into single models. OMG-LLaVA bridges image, object, and pixel-level understanding for multimodal large language model applications, while OMG-Seg offers a universal solution for over ten distinct segmentation tasks, including image, video, and open-vocabulary settings. The target audience includes researchers and practitioners in computer vision and multimodal AI seeking efficient, high-performance, and unified solutions.

How It Works

OMG-LLaVA integrates a universal segmentation method as a visual encoder, converting visual information and prompts into tokens for a large language model (LLM). The LLM handles text instructions, generating text responses and pixel-level segmentation outputs. This approach enables end-to-end training of a single encoder, decoder, and LLM for diverse multimodal tasks. OMG-Seg employs a transformer-based encoder-decoder architecture with task-specific queries and outputs, allowing a single model to handle numerous segmentation tasks efficiently with reduced computational overhead.

Quick Start & Requirements

OMG-Seg: Training code and models are released. Reproduction is possible on a single 32GB V100 or 40GB A100 machine.
OMG-LLaVA: Test code and 7B models are released. Full code is available.
Supports HuggingFace integration.
Official project pages and arXiv links are provided for detailed information.

Highlighted Details

OMG-LLaVA achieves image, object, and pixel-level reasoning in one model, matching or surpassing specialized methods.
OMG-Seg supports over ten segmentation tasks (semantic, instance, panoptic, video, open-vocabulary, interactive) with a unified architecture and ~70M trainable parameters.
Both codebases are open-sourced, including training, inference, and demo scripts.
OMG-LLaVA is accepted by NeurIPS-2024; OMG-Seg is accepted by CVPR-2024.

Maintenance & Community

The project has released code and checkpoints for both OMG-LLaVA and OMG-Seg. HuggingFace support is available. Links to project pages and arXiv papers are provided.

Licensing & Compatibility

OMG-Seg follows the S-Lab LICENSE.
OMG-LLaVA follows the Apache-2.0 license, respecting LLaVA and XTuner.

Limitations & Caveats

The README mentions a to-do list item for adding more easy-to-use tutorials. Specific performance benchmarks for all supported tasks are not exhaustively detailed within the main README.

OMG-Seg by lxtGH

Explore Similar Projects

cobra by h-zhao1997

VisionReasoner by JIA-Lab-research

R1-Onevision by Fancy-MLLM

VLM-Visualizer by zjysteven

Lumina-mGPT by Alpha-VLLM

Thyme by yfzhang114

ScreenAI by kyegomez

VisionLLM by OpenGVLab

Rex-Omni by IDEA-Research

MGM by JIA-Lab-research

DeepSeek-VL2 by deepseek-ai

Janus by deepseek-ai