bubogpt  by magic-research

Multi-modal LLM for joint text, vision, and audio understanding

created 2 years ago
511 stars

Top 62.0% on sourcepulse

GitHubView on GitHub
Project Summary

BuboGPT is a multi-modal large language model designed for joint understanding of text, vision, and audio, with a focus on grounding knowledge into visual objects. It targets researchers and developers working on advanced AI systems that require sophisticated cross-modal reasoning and interaction. The project enables more intuitive and context-aware AI applications by integrating audio and visual perception with language processing.

How It Works

BuboGPT builds upon existing multi-modal architectures, integrating vision and audio processing capabilities with a large language model. It leverages pre-trained models for vision (e.g., BLIP-2, GroundingDINO, SAM) and audio understanding, and fine-tunes them for joint multi-modal instruction following. This approach allows for a unified model that can process and reason across different data modalities, enabling tasks like audio-visual grounding and cross-modal question answering.

Quick Start & Requirements

  • Install: Clone the repository and install dependencies using pip3 install -r pre-requirements.txt and pip3 install -r requirements.txt.
  • Prerequisites: Python 3.9, CUDA 11.7, PyTorch 2.0.1. Requires downloading pre-trained checkpoints for Vicuna, BLIP-2, RAM, GroundingDINO, SAM, and BuboGPT itself.
  • Setup: Requires preparing specific datasets for different training stages.
  • Demo: Run python3 app.py --cfg-path eval_configs/mmgpt4_eval.yaml --gpu-id 0.
  • Links: Project Page, Arxiv, Demo Video, Gradio, Data, Model

Highlighted Details

  • Supports joint understanding of text, vision, and audio.
  • Enables visual grounding of knowledge into specific objects.
  • Demonstrates capabilities in image understanding, audio understanding, and aligned audio-image understanding.
  • Codebase is based on MiniGPT-4, GroundingDINO, Recognize Anything, and Segment Anything.

Maintenance & Community

  • Project lead: Bingyi Kang.
  • Huggingface demo released July 21, 2023.
  • Links to external resources like Arxiv and Huggingface demos are provided.

Licensing & Compatibility

  • The README does not explicitly state the license. However, its reliance on other repositories suggests potential licensing considerations from those projects. Compatibility for commercial use is not specified.

Limitations & Caveats

The setup process requires downloading multiple large pre-trained models and preparing specific datasets, which can be resource-intensive. The project is based on several other repositories, implying potential dependencies and maintenance considerations inherited from them.

Health Check
Last commit

2 years ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera).

AudioGPT by AIGC-Audio

0.1%
10k
Audio processing and generation research project
created 2 years ago
updated 1 year ago
Feedback? Help us improve.