MLLM for grounded visual understanding using localized visual tokenization
Top 56.8% on sourcepulse
Groma is a multimodal large language model (MLLM) designed for advanced region understanding and visual grounding. It enables users to input specific image regions (via bounding boxes) and receive detailed, contextually grounded responses, offering a novel approach to MLLMs by integrating localized visual tokenization.
How It Works
Groma employs a unique "visual tokenizer" approach for localization, distinguishing it from MLLMs that use LLMs for localization or external modules. This method directly tokenizes visual regions, allowing for more precise grounding and enabling the model to generate long-form, contextually relevant outputs tied to specific image areas.
Quick Start & Requirements
conda create -n groma python=3.9
), activate it, install PyTorch with CUDA 11.8 support, and then install Groma and MMCV with ops. Flash-attention is recommended for training.DATA.md
.python -m groma.eval.run_groma --model-name {path_to_groma_7b_finetune} --image-file {path_to_img} --query {user_query} --quant_type 'none'
Highlighted Details
none
, fp16
, 8bit
, 4bit
).Maintenance & Community
The project is associated with authors from FoundationVision and has an ECCV2024 paper. Links to community channels or roadmaps are not explicitly provided in the README.
Licensing & Compatibility
Licensed under the Apache License 2.0, permitting commercial use and integration with closed-source projects.
Limitations & Caveats
The README focuses on installation and training from scratch, with limited details on pre-trained model usage for quick inference beyond the provided command. Specific hardware requirements for optimal performance are not detailed.
1 year ago
1 day