Unified image understanding, generation, and editing model
Top 97.1% on SourcePulse
Nexus-Gen is a unified multimodal model designed for image understanding, generation, and editing, leveraging a shared embedding space between LLMs and diffusion models. It targets researchers and developers working on integrated visual AI systems, offering a single framework for diverse image-centric tasks.
How It Works
Nexus-Gen unifies tasks by mapping image representations into a shared embedding space with LLMs. It employs a multi-stage training strategy: first, an autoregressive model (Qwen2.5-VL-7B-Instruct) is pre-trained on a large dataset for multimodal understanding. Then, specialized decoders (FLUX.1-Dev) are adapted for generation and editing, with the editing decoder incorporating original image embeddings for better reconstruction. This approach allows for seamless integration of language reasoning and image synthesis.
Quick Start & Requirements
DiffSynth-Studio
repository and running pip install -e .
followed by pip install -r requirements.txt
.python download_models.py
.python app.py
.Highlighted Details
Maintenance & Community
The project has seen recent updates (July 2025) with the release of Nexus-Gen V2 and quantized models. The primary contributors are listed in the technical report. Links to ModelScope and Hugging Face model pages are provided.
Licensing & Compatibility
The repository itself does not explicitly state a license. However, it depends on Qwen2.5-VL-7B-Instruct and FLUX.1-Dev, whose licenses should be checked for compatibility with commercial or closed-source use.
Limitations & Caveats
Inference requires substantial VRAM (17-24 GB), potentially limiting accessibility. The project is actively developed, with V2 being a recent release, suggesting potential for ongoing changes and API evolution.
2 weeks ago
1 day