PandaGPT by yxuansu

Multimodal model for instruction following across six modalities

Created 2 years ago

833 stars

Top 42.7% on SourcePulse

Project Summary

PandaGPT is a multimodal instruction-following foundation model designed for researchers and power users. It enables a single model to process and respond to instructions across six modalities, including vision and audio, facilitating complex reasoning and cross-modal understanding.

How It Works

PandaGPT builds upon the ImageBind model and Vicuna language model, integrating them to create a unified instruction-following capability. It leverages delta weights for fine-tuning, allowing it to understand and generate responses based on combined visual and auditory inputs, composing their semantics naturally for tasks like detailed descriptions or story generation.

Quick Start & Requirements

Install dependencies: pip install -r requirements.txt and PyTorch with CUDA support (e.g., pip install torch==1.13.1+cu117).
Download checkpoints for ImageBind and Vicuna.
Download PandaGPT delta weights (e.g., openllmplayground/pandagpt_7b_max_len_1024).
Run demo: cd ./code/ && CUDA_VISIBLE_DEVICES=0 python web_demo.py.
For potential sample_rate issues, install pytorchvideo from source.
Official Demo: http://pandagpt.baai.ac.cn/
Paper: https://arxiv.org/abs/2305.16355

Highlighted Details

First foundation model for instruction-following across six modalities without explicit supervision.
Capable of complex multimodal understanding, reasoning, and knowledge-grounded generation.
Supports simultaneous processing of image and audio inputs for compositional semantics.
Offers pre-trained delta weights for Vicuna-7B and Vicuna-13B models.

Maintenance & Community

Major contributors are listed, with asterisks indicating primary contributors.
The project acknowledges contributions from OpenAlpaca, ImageBind, LLaVA, and MiniGPT-4.

Licensing & Compatibility

Intended and licensed for research use only.
Training dataset and delta weights are licensed under CC BY NC 4.0, restricting commercial use.

Limitations & Caveats

The project's dataset and delta weights are licensed under CC BY NC 4.0, strictly limiting usage to non-commercial, research purposes.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

4 stars in the last 30 days

Explore Similar Projects

Starred by

Wing Lian

Wing Lian(Founder of Axolotl AI).

cobra by h-zhao1997

Multimodal LLM research paper extending Mamba for efficient inference

Created 1 year ago

Updated 10 months ago

VARGPT by VARGPT-family

Multimodal LLM for visual understanding and generation tasks

Created 10 months ago

Updated 7 months ago

GroundingGPT by lzw-lzw

Multimodal grounding model (research paper)

Created 1 year ago

Updated 1 year ago

SEED-X by AILab-CVC

Multimodal AI assistant for real-world applications

Created 1 year ago

Updated 9 months ago

Thyme by yfzhang114

Multimodal reasoning and code execution for complex visual tasks

Created 3 months ago

Updated 2 months ago

ScreenAI by kyegomez

Vision-language model implementation for UI and infographics understanding

Created 1 year ago

Updated 1 month ago

Awesome-Unified-Multimodal-Models by showlab

Paper list for unified multimodal models

Created 1 year ago

Updated 1 month ago

VideoLLaMA3 by DAMO-NLP-SG

Multimodal foundation model for image/video understanding

Created 10 months ago

Updated 3 months ago

Starred by

Pawel Garbacki

Pawel Garbacki(Cofounder of Fireworks AI),

Thomas Wolf

Thomas Wolf(Cofounder of Hugging Face), and

3 more.

InternLM-XComposer by InternLM

Multimodal model for long-context video/audio interactions, image understanding, and composition

Created 2 years ago

Updated 6 months ago

Starred by

Leandro von Werra

Leandro von Werra(Head of Research at Hugging Face),

Lewis Tunstall

Lewis Tunstall(Research Engineer at Hugging Face), and

10 more.

smollm by huggingface

Lightweight AI models for text and vision tasks

Created 1 year ago

Updated 1 week ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI).

NExT-GPT by NExT-GPT

Any-to-any multimodal LLM research paper

Created 2 years ago

Updated 6 months ago

Starred by

Jesse Clark

Jesse Clark(Cofounder of Marqo) and

Jiaming Song

Jiaming Song(Chief Scientist at Luma AI).

DeepSeek-VL by deepseek-ai

Vision-language model for real-world applications (research paper)

Created 1 year ago

Updated 1 year ago

Feedback? Help us improve.