Show-o by showlab

Unified transformer research paper for multimodal tasks

Created 1 year ago

1,846 stars

Top 23.3% on SourcePulse

View on GitHub

1 Expert Loves This Project

Yaowei Zheng

Author of LLaMA-Factory

Project Summary

Show-o is a unified multimodal model designed for both understanding and generation tasks, targeting researchers and developers in AI. It aims to simplify multimodal AI by using a single Transformer architecture to handle diverse inputs and outputs, including image captioning, visual question answering, and text-to-image generation.

How It Works

Show-o tokenizes all input data, regardless of modality, into a unified sequence. It processes text autoregressively with causal attention and image tokens using discrete denoising diffusion modeling with full attention. This approach allows a single model to seamlessly switch between understanding and generation tasks, leveraging the strengths of both transformers and diffusion models.

Quick Start & Requirements

Install: pip3 install -r requirements.txt
Prerequisites: wandb account for logging.
Inference:
- Multimodal Understanding: python3 inference_mmu.py config=configs/showo_demo_w_clip_vit_512x512.yaml ...
- Text-to-Image Generation: python3 inference_t2i.py config=configs/showo_demo_512x512.yaml ...
- Text-guided Inpainting/Extrapolation: python3 inference_t2i.py config=configs/showo_demo.yaml ...
Resources: Requires wandb login. Specific hardware requirements for training are not detailed but imply distributed training capabilities.
Links: Hugging Face Demo, ArXiv Paper, Webpage

Highlighted Details

Supports multimodal understanding and generation tasks including image captioning, VQA, text-to-image, inpainting, and extrapolation.
Offers pre-trained checkpoints on Hugging Face for various configurations, including 512x512 image generation.
Includes training code for pre-training and instruction tuning, with support for distributed training via accelerate.
Integrates FlexAttention for acceleration.

Maintenance & Community

Accepted to ICLR 2025.
Active development with recent updates to the paper and feature additions.
Community platforms available via Discord and WeChat.
Maintains a list of "Awesome Unified Multimodal Models".

Licensing & Compatibility

The repository itself is not explicitly licensed in the README. However, it heavily relies on and acknowledges projects like open-muse, Phi-1.5, transformers, and diffusers, which have their own licenses. Users should verify compatibility.

Limitations & Caveats

The training pipeline requires manual configuration of accelerate and dataset paths, with specific notes for handling language modeling components.
Some features like scaling up model size and training on more datasets are listed as TODOs.
Mixed-modal generation and visual tokenizer training are pending features.

Health Check

Last Commit

3 days ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

34 stars in the last 30 days