Multimodal model for instruction following across six modalities
Top 44.6% on sourcepulse
PandaGPT is a multimodal instruction-following foundation model designed for researchers and power users. It enables a single model to process and respond to instructions across six modalities, including vision and audio, facilitating complex reasoning and cross-modal understanding.
How It Works
PandaGPT builds upon the ImageBind model and Vicuna language model, integrating them to create a unified instruction-following capability. It leverages delta weights for fine-tuning, allowing it to understand and generate responses based on combined visual and auditory inputs, composing their semantics naturally for tasks like detailed descriptions or story generation.
Quick Start & Requirements
pip install -r requirements.txt
and PyTorch with CUDA support (e.g., pip install torch==1.13.1+cu117
).openllmplayground/pandagpt_7b_max_len_1024
).cd ./code/ && CUDA_VISIBLE_DEVICES=0 python web_demo.py
.sample_rate
issues, install pytorchvideo
from source.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The project's dataset and delta weights are licensed under CC BY NC 4.0, strictly limiting usage to non-commercial, research purposes.
2 years ago
1 day