Emu3  by baaivision

Multimodal model for vision-language understanding and generation

created 10 months ago
2,174 stars

Top 21.2% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Emu3 is a suite of multimodal models that leverage next-token prediction for image and video generation and understanding. It aims to provide a unified, transformer-based approach to multimodal AI, outperforming specialized models without relying on diffusion or separate vision-language models.

How It Works

Emu3 tokenizes images, text, and videos into a discrete space, enabling a single transformer to process and generate multimodal sequences. This approach simplifies the architecture by eliminating the need for separate components like diffusion models or CLIP encoders, allowing for direct next-token prediction for tasks ranging from image generation to video understanding and extension.

Quick Start & Requirements

  • Install via pip install -r requirements.txt after cloning the repository.
  • Requires PyTorch, Hugging Face Transformers, and Flash Attention 2.
  • GPU with CUDA support is necessary for inference.
  • Official Hugging Face models and Modelscope links are provided for various Emu3 components.

Highlighted Details

  • Achieves state-of-the-art performance in image and video generation and understanding.
  • Outperforms models like SDXL, LLaVA-1.6, and OpenSora-1.2.
  • Supports flexible image resolutions and styles.
  • Capable of video generation and extension through causal next-token prediction.

Maintenance & Community

  • Developed by the Emu3 Team at BAAI.
  • Model weights and inference code are released. Training scripts for SFT are available.
  • Links to Hugging Face, Modelscope, and a project page are provided.

Licensing & Compatibility

  • The repository does not explicitly state a license.
  • Model weights are available on Hugging Face and Modelscope, implying potential usage terms.

Limitations & Caveats

  • Training scripts for pre-training and DPO are not yet released.
  • Evaluation code is also pending release.
  • The specific license for commercial use or closed-source linking is not detailed in the README.
Health Check
Last commit

4 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
78 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.