Emu3 by baaivision

Multimodal model for vision-language understanding and generation

Created 1 year ago

2,272 stars

Top 19.8% on SourcePulse

3 Experts Love This Project

pgarbacki

Cofounder of Fireworks AI

andreasjansson

Andreas Jansson

Cofounder of Replicate

jiamings

Chief Scientist at Luma AI

Project Summary

Emu3 is a suite of multimodal models that leverage next-token prediction for image and video generation and understanding. It aims to provide a unified, transformer-based approach to multimodal AI, outperforming specialized models without relying on diffusion or separate vision-language models.

How It Works

Emu3 tokenizes images, text, and videos into a discrete space, enabling a single transformer to process and generate multimodal sequences. This approach simplifies the architecture by eliminating the need for separate components like diffusion models or CLIP encoders, allowing for direct next-token prediction for tasks ranging from image generation to video understanding and extension.

Quick Start & Requirements

Install via pip install -r requirements.txt after cloning the repository.
Requires PyTorch, Hugging Face Transformers, and Flash Attention 2.
GPU with CUDA support is necessary for inference.
Official Hugging Face models and Modelscope links are provided for various Emu3 components.

Highlighted Details

Achieves state-of-the-art performance in image and video generation and understanding.
Outperforms models like SDXL, LLaVA-1.6, and OpenSora-1.2.
Supports flexible image resolutions and styles.
Capable of video generation and extension through causal next-token prediction.

Maintenance & Community

Developed by the Emu3 Team at BAAI.
Model weights and inference code are released. Training scripts for SFT are available.
Links to Hugging Face, Modelscope, and a project page are provided.

Licensing & Compatibility

The repository does not explicitly state a license.
Model weights are available on Hugging Face and Modelscope, implying potential usage terms.

Limitations & Caveats

Training scripts for pre-training and DPO are not yet released.
Evaluation code is also pending release.
The specific license for commercial use or closed-source linking is not detailed in the README.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

1

Star History

14 stars in the last 30 days

Explore Similar Projects

LongVA by EvolvingLMMs-Lab

Vision-language model for long context understanding

Created 1 year ago

Updated 9 months ago

VLM-Visualizer by zjysteven

Visualizing attention in vision-language models

Created 1 year ago

Updated 10 months ago

LaVIT by jy0205

Multimodal LLM for visual content understanding and generation

Created 2 years ago

Updated 1 year ago

Starred by

Jiaming Song

Jiaming Song(Chief Scientist at Luma AI).

Lumina-mGPT by Alpha-VLLM

Multimodal autoregressive model for vision and language tasks

Created 1 year ago

Updated 2 months ago

Starred by

Jeffrey Morgan

Jeffrey Morgan(Cofounder of Ollama).

Liquid by FoundationVision

Multimodal generation research paper

Created 1 year ago

Updated 2 months ago

Osprey by CircleRadon

Research paper for pixel understanding via visual instruction tuning

Created 2 years ago

Updated 4 months ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind).

MiniGPT-5 by eric-ai-lab

Research paper for interleaved vision-and-language generation

Created 2 years ago

Updated 8 months ago

VisionLLM by OpenGVLab

Multimodal LLM for vision-centric tasks

Created 2 years ago

Updated 10 months ago

MiniGPT-4-ZH by RiseInRose

Vision-language model enhances understanding using LLMs

Created 2 years ago

Updated 14 hours ago

Sa2VA by bytedance

Multimodal model for dense grounded image/video understanding

Created 1 year ago

Updated 1 day ago

Starred by

Pawel Garbacki

Pawel Garbacki(Cofounder of Fireworks AI),

Thomas Wolf

Thomas Wolf(Cofounder of Hugging Face), and

3 more.

InternLM-XComposer by InternLM

Multimodal model for long-context video/audio interactions, image understanding, and composition

Created 2 years ago

Updated 7 months ago

Starred by

Jiaming Song

Jiaming Song(Chief Scientist at Luma AI),

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI), and

7 more.

Janus by deepseek-ai

Unified multimodal model research paper for understanding and generation

Created 1 year ago

Updated 11 months ago

Feedback? Help us improve.