Groma by FoundationVision

MLLM for grounded visual understanding using localized visual tokenization

Created 1 year ago

581 stars

Top 55.8% on SourcePulse

Project Summary

Groma is a multimodal large language model (MLLM) designed for advanced region understanding and visual grounding. It enables users to input specific image regions (via bounding boxes) and receive detailed, contextually grounded responses, offering a novel approach to MLLMs by integrating localized visual tokenization.

How It Works

Groma employs a unique "visual tokenizer" approach for localization, distinguishing it from MLLMs that use LLMs for localization or external modules. This method directly tokenizes visual regions, allowing for more precise grounding and enabling the model to generate long-form, contextually relevant outputs tied to specific image areas.

Quick Start & Requirements

Install: Clone the repository, create a conda environment (conda create -n groma python=3.9), activate it, install PyTorch with CUDA 11.8 support, and then install Groma and MMCV with ops. Flash-attention is recommended for training.
Prerequisites: Python 3.9, PyTorch 2.1.0, torchvision 0.16.0, torchaudio 2.1.0, CUDA 11.8, MMCV with ops, Ninja, flash-attn.
Model Weights: Downloadable from Hugging Face.
Data: Datasets for detection, alignment, and instruction finetuning are detailed in DATA.md.
Inference: python -m groma.eval.run_groma --model-name {path_to_groma_7b_finetune} --image-file {path_to_img} --query {user_query} --quant_type 'none'
Resources: Requires significant disk space for datasets and model weights. Training requires substantial GPU resources.

Highlighted Details

State-of-the-art performance on referring expression comprehension (REC) benchmarks among MLLMs.
Novel localized visual tokenization paradigm for enhanced region understanding.
Built upon LLaVA and GPT4ROI.
Supports various quantization types for inference (none, fp16, 8bit, 4bit).

Maintenance & Community

The project is associated with authors from FoundationVision and has an ECCV2024 paper. Links to community channels or roadmaps are not explicitly provided in the README.

Licensing & Compatibility

Licensed under the Apache License 2.0, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The README focuses on installation and training from scratch, with limited details on pre-trained model usage for quick inference beyond the provided command. Specific hardware requirements for optimal performance are not detailed.

Groma by FoundationVision

Explore Similar Projects

all-seeing by OpenGVLab

LaVIT by jy0205

fromage by kohjingyu

Osprey by CircleRadon

groundingLMM by mbzuai-oryx

Seed1.5-VL by ByteDance-Seed

Rex-Omni by IDEA-Research

SoM by microsoft

VL-BERT by jackroos

InternLM-XComposer by InternLM

Qwen-VL by QwenLM

ml-ferret by apple