Show-o  by showlab

Unified transformer research paper for multimodal tasks

Created 1 year ago
1,700 stars

Top 25.0% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Show-o is a unified multimodal model designed for both understanding and generation tasks, targeting researchers and developers in AI. It aims to simplify multimodal AI by using a single Transformer architecture to handle diverse inputs and outputs, including image captioning, visual question answering, and text-to-image generation.

How It Works

Show-o tokenizes all input data, regardless of modality, into a unified sequence. It processes text autoregressively with causal attention and image tokens using discrete denoising diffusion modeling with full attention. This approach allows a single model to seamlessly switch between understanding and generation tasks, leveraging the strengths of both transformers and diffusion models.

Quick Start & Requirements

  • Install: pip3 install -r requirements.txt
  • Prerequisites: wandb account for logging.
  • Inference:
    • Multimodal Understanding: python3 inference_mmu.py config=configs/showo_demo_w_clip_vit_512x512.yaml ...
    • Text-to-Image Generation: python3 inference_t2i.py config=configs/showo_demo_512x512.yaml ...
    • Text-guided Inpainting/Extrapolation: python3 inference_t2i.py config=configs/showo_demo.yaml ...
  • Resources: Requires wandb login. Specific hardware requirements for training are not detailed but imply distributed training capabilities.
  • Links: Hugging Face Demo, ArXiv Paper, Webpage

Highlighted Details

  • Supports multimodal understanding and generation tasks including image captioning, VQA, text-to-image, inpainting, and extrapolation.
  • Offers pre-trained checkpoints on Hugging Face for various configurations, including 512x512 image generation.
  • Includes training code for pre-training and instruction tuning, with support for distributed training via accelerate.
  • Integrates FlexAttention for acceleration.

Maintenance & Community

  • Accepted to ICLR 2025.
  • Active development with recent updates to the paper and feature additions.
  • Community platforms available via Discord and WeChat.
  • Maintains a list of "Awesome Unified Multimodal Models".

Licensing & Compatibility

  • The repository itself is not explicitly licensed in the README. However, it heavily relies on and acknowledges projects like open-muse, Phi-1.5, transformers, and diffusers, which have their own licenses. Users should verify compatibility.

Limitations & Caveats

  • The training pipeline requires manual configuration of accelerate and dataset paths, with specific notes for handling language modeling components.
  • Some features like scaling up model size and training on more datasets are listed as TODOs.
  • Mixed-modal generation and visual tokenizer training are pending features.
Health Check
Last Commit

1 week ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
2
Star History
55 stars in the last 30 days

Explore Similar Projects

Starred by Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
1 more.

METER by zdou0830

0%
373
Multimodal framework for vision-and-language transformer research
Created 3 years ago
Updated 2 years ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Wing Lian Wing Lian(Founder of Axolotl AI), and
10 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
10 more.

x-transformers by lucidrains

0.2%
6k
Transformer library with extensive experimental features
Created 4 years ago
Updated 5 days ago
Feedback? Help us improve.