JoyAI-Image by jd-opensource

Unified multimodal model for vision and generation

Created 1 month ago

2,156 stars

Top 20.3% on SourcePulse

Project Summary

JoyAI-Image is a unified multimodal foundation model addressing image understanding, text-to-image generation, and instruction-guided editing. It targets researchers and developers seeking advanced spatial reasoning and controllable image manipulation, providing a single, integrated solution for diverse visual AI tasks.

How It Works

The architecture combines an 8B Multimodal Large Language Model (MLLM) with a 16B Multimodal Diffusion Transformer (MMDiT). Its core innovation lies in a closed-loop collaboration: enhanced spatial understanding from the MLLM improves the MMDiT's generation and editing capabilities, while generative transformations provide complementary data for spatial reasoning. This bidirectional feedback loop aims to awaken and strengthen spatial intelligence within the model.

Quick Start & Requirements

Installation: Requires Python >= 3.10 and a CUDA-capable GPU. Setup involves creating a virtual environment (conda create -n joyai python=3.10 -y, conda activate joyai) followed by pip install -e ..
Dependencies: Core packages include PyTorch (>= 2.8), Transformers (>= 4.57.0, < 4.58.0), Diffusers (>= 0.34.0), and Flash Attention (>= 2.8.0) for optimal performance.
Resources: Checkpoints are available on Hugging Face. Official documentation or demo links are not explicitly provided in the README beyond the report PDF.

Highlighted Details

Unified Multimodal Foundation: A single model family supports understanding, generation, and editing via a shared MLLM-MMDiT interface.
Awakened Spatial Intelligence: Demonstrates strong spatial understanding, controllable spatial editing, and novel-view-assisted reasoning through its bidirectional loop.
Advanced Visual Generation: Excels in long-text typography, layout fidelity, multi-view generation, and precise, structure-preserving editing.
Specialized Models: Offers distinct models like JoyAI-Image-Und for understanding and JoyAI-Image-Edit for instruction-guided editing.

Maintenance & Community

The project is actively hiring Research Scientists, Engineers, and Interns for next-generation generative models. Interested candidates can send resumes to huanghaoyang.ocean@jd.com. No community channels (e.g., Discord, Slack) are listed.

Licensing & Compatibility

Licensed under the Apache 2.0 license, which permits commercial use and modification.

Limitations & Caveats

Several advanced models, including JoyAI-Image-Edit-Distilled, JoyAI-Image-Edit-Plus (multi-image editing), and the core JoyAI-Image text-to-image model, are marked as "To be released," indicating they are not yet available. The README also references a speculative gpt-5 model for prompt rewriting.

Health Check

Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

209 stars in the last 30 days