LLaDA2.0-Uni by inclusionAI

Unified multimodal understanding and generation model

Created 1 week ago

New!

543 stars

Top 58.4% on SourcePulse

Project Summary

LLaDA2.0-Uni unifies multimodal understanding and generation tasks within a single Diffusion Large Language Model (dLLM) architecture. It targets researchers and developers seeking a versatile model for complex visual-linguistic applications, offering integrated capabilities for image comprehension, generation, and editing with a focus on efficiency and high fidelity.

How It Works

The project introduces a unified dLLM-based Mixture-of-Experts (MoE) backbone, built upon LLaDA 2.0, which employs a Mask Token Prediction paradigm for seamless multimodal integration. Visual inputs are converted into discrete semantic tokens using SigLIP-VQ, enhancing understanding. High-fidelity generation is achieved through a specialized diffusion decoder, optimized for rapid 8-step inference via distillation.

Quick Start & Requirements

Installation: Clone the repository, create and activate a conda environment (Python 3.10), install PyTorch with CUDA 12.4 support, install Flash Attention 2 for efficient inference, and then install remaining dependencies via requirements.txt.
Prerequisites: Python 3.10, CUDA 12.4, PyTorch, Flash Attention 2. Inference requires GPU acceleration.
Links: Technical Report [https://arxiv.org/abs/2604.20796], HuggingFace [https://huggingface.co/inclusionAI/LLaDA2.0-Uni], ModelScope [https://modelscope.cn/models/inclusionAI/LLaDA2.0-Uni].

Highlighted Details

Achieves top-tier performance in visual question answering and document understanding, comparable to dedicated VLMs.
Generates highly detailed images and supports flexible single or multi-reference image editing while preserving original details.
Enables complex interleaved generation and advanced reasoning tasks through unified discrete representations.
Features SPRINT acceleration, combining KV cache reuse, adaptive unmasking, and batch acceptance for significantly faster inference.

Maintenance & Community

The initial version was released on April 23, 2026. No specific community channels (e.g., Discord, Slack) or roadmap links are provided in the README. SGLang support is listed as "Coming Soon."

Licensing & Compatibility

The project is licensed under the Apache License 2.0, which is generally permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

SPRINT acceleration automatically falls back to the baseline method when using Editing CFG (three-way guidance). The provided inference code examples rely on local asset files (e.g., ./assets/understanding_example.png).

LLaDA2.0-Uni by inclusionAI

Explore Similar Projects

InternVL-U by OpenGVLab

Ovis-U1 by AIDC-AI

NextFlow by ByteVisionLab

LLaVA-UHD by thunlp

Cheers by AI9Stars

NextStep-1 by stepfun-ai

BitDance by shallowdream204

SEED by AILab-CVC

Liquid by FoundationVision

RPG-DiffusionMaster by YangLing0818

HunyuanImage-3.0 by Tencent-Hunyuan

Janus by deepseek-ai