LLaDA2.0-Uni  by inclusionAI

Unified multimodal understanding and generation model

Created 1 week ago

New!

543 stars

Top 58.4% on SourcePulse

GitHubView on GitHub
Project Summary

LLaDA2.0-Uni unifies multimodal understanding and generation tasks within a single Diffusion Large Language Model (dLLM) architecture. It targets researchers and developers seeking a versatile model for complex visual-linguistic applications, offering integrated capabilities for image comprehension, generation, and editing with a focus on efficiency and high fidelity.

How It Works

The project introduces a unified dLLM-based Mixture-of-Experts (MoE) backbone, built upon LLaDA 2.0, which employs a Mask Token Prediction paradigm for seamless multimodal integration. Visual inputs are converted into discrete semantic tokens using SigLIP-VQ, enhancing understanding. High-fidelity generation is achieved through a specialized diffusion decoder, optimized for rapid 8-step inference via distillation.

Quick Start & Requirements

Highlighted Details

  • Achieves top-tier performance in visual question answering and document understanding, comparable to dedicated VLMs.
  • Generates highly detailed images and supports flexible single or multi-reference image editing while preserving original details.
  • Enables complex interleaved generation and advanced reasoning tasks through unified discrete representations.
  • Features SPRINT acceleration, combining KV cache reuse, adaptive unmasking, and batch acceptance for significantly faster inference.

Maintenance & Community

The initial version was released on April 23, 2026. No specific community channels (e.g., Discord, Slack) or roadmap links are provided in the README. SGLang support is listed as "Coming Soon."

Licensing & Compatibility

The project is licensed under the Apache License 2.0, which is generally permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

SPRINT acceleration automatically falls back to the baseline method when using Editing CFG (three-way guidance). The provided inference code examples rely on local asset files (e.g., ./assets/understanding_example.png).

Health Check
Last Commit

4 days ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
0
Star History
566 stars in the last 11 days

Explore Similar Projects

Feedback? Help us improve.