Show-o  by showlab

Unified transformer research paper for multimodal tasks

created 11 months ago
1,625 stars

Top 26.4% on sourcepulse

GitHubView on GitHub
Project Summary

Show-o is a unified multimodal model designed for both understanding and generation tasks, targeting researchers and developers in AI. It aims to simplify multimodal AI by using a single Transformer architecture to handle diverse inputs and outputs, including image captioning, visual question answering, and text-to-image generation.

How It Works

Show-o tokenizes all input data, regardless of modality, into a unified sequence. It processes text autoregressively with causal attention and image tokens using discrete denoising diffusion modeling with full attention. This approach allows a single model to seamlessly switch between understanding and generation tasks, leveraging the strengths of both transformers and diffusion models.

Quick Start & Requirements

  • Install: pip3 install -r requirements.txt
  • Prerequisites: wandb account for logging.
  • Inference:
    • Multimodal Understanding: python3 inference_mmu.py config=configs/showo_demo_w_clip_vit_512x512.yaml ...
    • Text-to-Image Generation: python3 inference_t2i.py config=configs/showo_demo_512x512.yaml ...
    • Text-guided Inpainting/Extrapolation: python3 inference_t2i.py config=configs/showo_demo.yaml ...
  • Resources: Requires wandb login. Specific hardware requirements for training are not detailed but imply distributed training capabilities.
  • Links: Hugging Face Demo, ArXiv Paper, Webpage

Highlighted Details

  • Supports multimodal understanding and generation tasks including image captioning, VQA, text-to-image, inpainting, and extrapolation.
  • Offers pre-trained checkpoints on Hugging Face for various configurations, including 512x512 image generation.
  • Includes training code for pre-training and instruction tuning, with support for distributed training via accelerate.
  • Integrates FlexAttention for acceleration.

Maintenance & Community

  • Accepted to ICLR 2025.
  • Active development with recent updates to the paper and feature additions.
  • Community platforms available via Discord and WeChat.
  • Maintains a list of "Awesome Unified Multimodal Models".

Licensing & Compatibility

  • The repository itself is not explicitly licensed in the README. However, it heavily relies on and acknowledges projects like open-muse, Phi-1.5, transformers, and diffusers, which have their own licenses. Users should verify compatibility.

Limitations & Caveats

  • The training pipeline requires manual configuration of accelerate and dataset paths, with specific notes for handling language modeling components.
  • Some features like scaling up model size and training on more datasets are listed as TODOs.
  • Mixed-modal generation and visual tokenizer training are pending features.
Health Check
Last commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
9
Star History
267 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
4 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
created 2 years ago
updated 11 months ago
Feedback? Help us improve.