Unified transformer research paper for multimodal tasks
Top 26.4% on sourcepulse
Show-o is a unified multimodal model designed for both understanding and generation tasks, targeting researchers and developers in AI. It aims to simplify multimodal AI by using a single Transformer architecture to handle diverse inputs and outputs, including image captioning, visual question answering, and text-to-image generation.
How It Works
Show-o tokenizes all input data, regardless of modality, into a unified sequence. It processes text autoregressively with causal attention and image tokens using discrete denoising diffusion modeling with full attention. This approach allows a single model to seamlessly switch between understanding and generation tasks, leveraging the strengths of both transformers and diffusion models.
Quick Start & Requirements
pip3 install -r requirements.txt
wandb
account for logging.python3 inference_mmu.py config=configs/showo_demo_w_clip_vit_512x512.yaml ...
python3 inference_t2i.py config=configs/showo_demo_512x512.yaml ...
python3 inference_t2i.py config=configs/showo_demo.yaml ...
wandb
login. Specific hardware requirements for training are not detailed but imply distributed training capabilities.Highlighted Details
accelerate
.Maintenance & Community
Licensing & Compatibility
open-muse
, Phi-1.5
, transformers
, and diffusers
, which have their own licenses. Users should verify compatibility.Limitations & Caveats
accelerate
and dataset paths, with specific notes for handling language modeling components.2 days ago
1 day