bubogpt  by magic-research

Multi-modal LLM for joint text, vision, and audio understanding

Created 2 years ago
509 stars

Top 61.4% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

BuboGPT is a multi-modal large language model designed for joint understanding of text, vision, and audio, with a focus on grounding knowledge into visual objects. It targets researchers and developers working on advanced AI systems that require sophisticated cross-modal reasoning and interaction. The project enables more intuitive and context-aware AI applications by integrating audio and visual perception with language processing.

How It Works

BuboGPT builds upon existing multi-modal architectures, integrating vision and audio processing capabilities with a large language model. It leverages pre-trained models for vision (e.g., BLIP-2, GroundingDINO, SAM) and audio understanding, and fine-tunes them for joint multi-modal instruction following. This approach allows for a unified model that can process and reason across different data modalities, enabling tasks like audio-visual grounding and cross-modal question answering.

Quick Start & Requirements

  • Install: Clone the repository and install dependencies using pip3 install -r pre-requirements.txt and pip3 install -r requirements.txt.
  • Prerequisites: Python 3.9, CUDA 11.7, PyTorch 2.0.1. Requires downloading pre-trained checkpoints for Vicuna, BLIP-2, RAM, GroundingDINO, SAM, and BuboGPT itself.
  • Setup: Requires preparing specific datasets for different training stages.
  • Demo: Run python3 app.py --cfg-path eval_configs/mmgpt4_eval.yaml --gpu-id 0.
  • Links: Project Page, Arxiv, Demo Video, Gradio, Data, Model

Highlighted Details

  • Supports joint understanding of text, vision, and audio.
  • Enables visual grounding of knowledge into specific objects.
  • Demonstrates capabilities in image understanding, audio understanding, and aligned audio-image understanding.
  • Codebase is based on MiniGPT-4, GroundingDINO, Recognize Anything, and Segment Anything.

Maintenance & Community

  • Project lead: Bingyi Kang.
  • Huggingface demo released July 21, 2023.
  • Links to external resources like Arxiv and Huggingface demos are provided.

Licensing & Compatibility

  • The README does not explicitly state the license. However, its reliance on other repositories suggests potential licensing considerations from those projects. Compatibility for commercial use is not specified.

Limitations & Caveats

The setup process requires downloading multiple large pre-trained models and preparing specific datasets, which can be resource-intensive. The project is based on several other repositories, implying potential dependencies and maintenance considerations inherited from them.

Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.