groundingLMM by mbzuai-oryx

LMM for grounded conversation generation

Created 2 years ago

938 stars

Top 39.0% on SourcePulse

Project Summary

GLaMM (Grounding Large Multimodal Model) is a CVPR 2024 paper presenting a novel approach to grounded conversation generation, integrating natural language responses with object segmentation masks. It targets researchers and developers working on advanced vision-language tasks, offering a unified framework for phrase grounding, referring expression segmentation, and conversational AI with detailed visual understanding.

How It Works

GLaMM is an end-to-end trained Large Multimodal Model (LMM) designed to process both image-level and region-specific visual inputs alongside text. Its core innovation lies in generating natural language responses that are directly linked to precise object segmentation masks, enabling a new task called Grounded Conversation Generation (GCG). This approach allows for interaction at multiple granularities, from whole images to specific regions, facilitating detailed visual grounding and reasoning.

Quick Start & Requirements

Installation: Setup involves creating a conda environment. Detailed guides are provided for training, evaluation, and running a local demo.
Prerequisites: Requires specific datasets (GranD, GranD-f) and pretrained checkpoints, which are available for download.
Resources: Links to official guides for installation, datasets, model zoo, training, evaluation, and a demo are available.

Highlighted Details

Introduces the novel Grounded Conversation Generation (GCG) task and its evaluation protocol.
Presents the GranD dataset, featuring 7.5M unique concepts grounded in 810M regions with segmentation masks.
Demonstrates strong performance in referring expression segmentation, region-level captioning, and conversational QA.
Offers pretrained checkpoints for GLaMM's various capabilities.

Maintenance & Community

The project is associated with multiple universities and Google Research. Updates include the release of VideoGLaMM and the GranD dataset with an automated annotation pipeline. The paper is available on arXiv.

Licensing & Compatibility

The README does not explicitly state the license type or any restrictions for commercial use or closed-source linking.

Limitations & Caveats

The project is presented as a research artifact from CVPR 2024, and specific details regarding ongoing maintenance, community support channels, or potential limitations are not detailed in the provided README.

groundingLMM by mbzuai-oryx

Explore Similar Projects

Groma by FoundationVision

locomo by snap-research

ToD-BERT by jasonwu0731

awesome-nlg by accelerated-text

generative-ai-go by google

gpt2bot by polakowo

Paper-Reading-ConvAI by iwangjian

mindmeld by cisco

multiwoz by budzianowski

InternGPT by OpenGVLab

conv-emotion by declare-lab

GPT2-chitchat by yangjianxin1