Groma  by FoundationVision

MLLM for grounded visual understanding using localized visual tokenization

created 1 year ago
577 stars

Top 56.8% on sourcepulse

GitHubView on GitHub
Project Summary

Groma is a multimodal large language model (MLLM) designed for advanced region understanding and visual grounding. It enables users to input specific image regions (via bounding boxes) and receive detailed, contextually grounded responses, offering a novel approach to MLLMs by integrating localized visual tokenization.

How It Works

Groma employs a unique "visual tokenizer" approach for localization, distinguishing it from MLLMs that use LLMs for localization or external modules. This method directly tokenizes visual regions, allowing for more precise grounding and enabling the model to generate long-form, contextually relevant outputs tied to specific image areas.

Quick Start & Requirements

  • Install: Clone the repository, create a conda environment (conda create -n groma python=3.9), activate it, install PyTorch with CUDA 11.8 support, and then install Groma and MMCV with ops. Flash-attention is recommended for training.
  • Prerequisites: Python 3.9, PyTorch 2.1.0, torchvision 0.16.0, torchaudio 2.1.0, CUDA 11.8, MMCV with ops, Ninja, flash-attn.
  • Model Weights: Downloadable from Hugging Face.
  • Data: Datasets for detection, alignment, and instruction finetuning are detailed in DATA.md.
  • Inference: python -m groma.eval.run_groma --model-name {path_to_groma_7b_finetune} --image-file {path_to_img} --query {user_query} --quant_type 'none'
  • Resources: Requires significant disk space for datasets and model weights. Training requires substantial GPU resources.

Highlighted Details

  • State-of-the-art performance on referring expression comprehension (REC) benchmarks among MLLMs.
  • Novel localized visual tokenization paradigm for enhanced region understanding.
  • Built upon LLaVA and GPT4ROI.
  • Supports various quantization types for inference (none, fp16, 8bit, 4bit).

Maintenance & Community

The project is associated with authors from FoundationVision and has an ECCV2024 paper. Links to community channels or roadmaps are not explicitly provided in the README.

Licensing & Compatibility

Licensed under the Apache License 2.0, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The README focuses on installation and training from scratch, with limited details on pre-trained model usage for quick inference beyond the provided command. Specific hardware requirements for optimal performance are not detailed.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
2
Star History
19 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.