bubogpt by magic-research

Multi-modal LLM for joint text, vision, and audio understanding

Created 2 years ago

511 stars

Top 61.2% on SourcePulse

View on GitHub

1 Expert Loves This Project

Wing Lian

Founder of Axolotl AI

Project Summary

BuboGPT is a multi-modal large language model designed for joint understanding of text, vision, and audio, with a focus on grounding knowledge into visual objects. It targets researchers and developers working on advanced AI systems that require sophisticated cross-modal reasoning and interaction. The project enables more intuitive and context-aware AI applications by integrating audio and visual perception with language processing.

How It Works

BuboGPT builds upon existing multi-modal architectures, integrating vision and audio processing capabilities with a large language model. It leverages pre-trained models for vision (e.g., BLIP-2, GroundingDINO, SAM) and audio understanding, and fine-tunes them for joint multi-modal instruction following. This approach allows for a unified model that can process and reason across different data modalities, enabling tasks like audio-visual grounding and cross-modal question answering.

Quick Start & Requirements

Install: Clone the repository and install dependencies using pip3 install -r pre-requirements.txt and pip3 install -r requirements.txt.
Prerequisites: Python 3.9, CUDA 11.7, PyTorch 2.0.1. Requires downloading pre-trained checkpoints for Vicuna, BLIP-2, RAM, GroundingDINO, SAM, and BuboGPT itself.
Setup: Requires preparing specific datasets for different training stages.
Demo: Run python3 app.py --cfg-path eval_configs/mmgpt4_eval.yaml --gpu-id 0.
Links: Project Page, Arxiv, Demo Video, Gradio, Data, Model

Highlighted Details

Supports joint understanding of text, vision, and audio.
Enables visual grounding of knowledge into specific objects.
Demonstrates capabilities in image understanding, audio understanding, and aligned audio-image understanding.
Codebase is based on MiniGPT-4, GroundingDINO, Recognize Anything, and Segment Anything.

Maintenance & Community

Project lead: Bingyi Kang.
Huggingface demo released July 21, 2023.
Links to external resources like Arxiv and Huggingface demos are provided.

Licensing & Compatibility

The README does not explicitly state the license. However, its reliance on other repositories suggests potential licensing considerations from those projects. Compatibility for commercial use is not specified.

Limitations & Caveats

The setup process requires downloading multiple large pre-trained models and preparing specific datasets, which can be resource-intensive. The project is based on several other repositories, implying potential dependencies and maintenance considerations inherited from them.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days