groundingLMM  by mbzuai-oryx

LMM for grounded conversation generation

created 1 year ago
901 stars

Top 41.2% on sourcepulse

GitHubView on GitHub
Project Summary

GLaMM (Grounding Large Multimodal Model) is a CVPR 2024 paper presenting a novel approach to grounded conversation generation, integrating natural language responses with object segmentation masks. It targets researchers and developers working on advanced vision-language tasks, offering a unified framework for phrase grounding, referring expression segmentation, and conversational AI with detailed visual understanding.

How It Works

GLaMM is an end-to-end trained Large Multimodal Model (LMM) designed to process both image-level and region-specific visual inputs alongside text. Its core innovation lies in generating natural language responses that are directly linked to precise object segmentation masks, enabling a new task called Grounded Conversation Generation (GCG). This approach allows for interaction at multiple granularities, from whole images to specific regions, facilitating detailed visual grounding and reasoning.

Quick Start & Requirements

  • Installation: Setup involves creating a conda environment. Detailed guides are provided for training, evaluation, and running a local demo.
  • Prerequisites: Requires specific datasets (GranD, GranD-f) and pretrained checkpoints, which are available for download.
  • Resources: Links to official guides for installation, datasets, model zoo, training, evaluation, and a demo are available.

Highlighted Details

  • Introduces the novel Grounded Conversation Generation (GCG) task and its evaluation protocol.
  • Presents the GranD dataset, featuring 7.5M unique concepts grounded in 810M regions with segmentation masks.
  • Demonstrates strong performance in referring expression segmentation, region-level captioning, and conversational QA.
  • Offers pretrained checkpoints for GLaMM's various capabilities.

Maintenance & Community

The project is associated with multiple universities and Google Research. Updates include the release of VideoGLaMM and the GranD dataset with an automated annotation pipeline. The paper is available on arXiv.

Licensing & Compatibility

The README does not explicitly state the license type or any restrictions for commercial use or closed-source linking.

Limitations & Caveats

The project is presented as a research artifact from CVPR 2024, and specific details regarding ongoing maintenance, community support channels, or potential limitations are not detailed in the provided README.

Health Check
Last commit

1 month ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
1
Star History
36 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.