GroundingGPT  by lzw-lzw

Multimodal grounding model (research paper)

Created 1 year ago
336 stars

Top 81.8% on SourcePulse

GitHubView on GitHub
Project Summary

GroundingGPT is a multimodal grounding model designed for accurate comprehension and robust grounding across images, audio, and video. It targets researchers and developers working on advanced multimodal AI, offering a unified approach to complex grounding tasks and providing a valuable, diverse training dataset to advance the field.

How It Works

GroundingGPT employs a language-enhanced architecture to integrate multimodal inputs. It leverages pre-trained models like ImageBind and BLIP-2, fine-tuning them for enhanced spatial and temporal understanding. This approach aims to improve accuracy and robustness in grounding by effectively combining linguistic context with visual and auditory information.

Quick Start & Requirements

  • Install: Clone the repository, create a conda environment (conda create -n groundinggpt python=3.10), activate it, and install dependencies (pip install -r requirements.txt).
  • Prerequisites: Requires checkpoints for ImageBind (imagebind_huge.pth) and BLIP-2 (blip2_pretrained_flant5xxl.pth), which need to be downloaded and placed in the ./ckpt/ directory. Various datasets (LLaVA, COCO, GQA, etc.) are also required, with instructions to follow their respective repositories.
  • Demo: Launch a Gradio web demo with python3 lego/serve/gradio_web_server.py after downloading the GroundingGPT-7B model and updating the model path.
  • Inference: Run inference using python3 lego/serve/cli.py after downloading the GroundingGPT-7B model and updating the model path.
  • Links: Project page: https://github.com/lzw-lzw/GroundingGPT

Highlighted Details

  • End-to-end multimodal grounding model.
  • Supports images, audio, and video inputs.
  • Includes a diverse, high-quality multimodal training dataset with spatial and temporal information.
  • Accepted to ACL 2024.

Maintenance & Community

The project is associated with the ACL 2024 conference. Further community or maintenance details are not explicitly provided in the README.

Licensing & Compatibility

The repository does not explicitly state a license. Given the academic nature (ACL 2024) and reliance on other models, users should verify licensing for commercial use and closed-source integration.

Limitations & Caveats

The setup requires downloading multiple large checkpoints and datasets, which can be time-consuming and resource-intensive. The README implies a focus on research and may not be optimized for production deployment without further adaptation.

Health Check
Last Commit

10 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Jiaming Song Jiaming Song(Chief Scientist at Luma AI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
6 more.

Otter by EvolvingLMMs-Lab

0.0%
3k
Multimodal model for improved instruction following and in-context learning
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Wing Lian Wing Lian(Founder of Axolotl AI), and
10 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Elvis Saravia Elvis Saravia(Founder of DAIR.AI).

NExT-GPT by NExT-GPT

0.1%
4k
Any-to-any multimodal LLM research paper
Created 2 years ago
Updated 4 months ago
Feedback? Help us improve.