Nexus-Gen  by modelscope

Unified image understanding, generation, and editing model

created 3 months ago
262 stars

Top 97.1% on SourcePulse

GitHubView on GitHub
Project Summary

Nexus-Gen is a unified multimodal model designed for image understanding, generation, and editing, leveraging a shared embedding space between LLMs and diffusion models. It targets researchers and developers working on integrated visual AI systems, offering a single framework for diverse image-centric tasks.

How It Works

Nexus-Gen unifies tasks by mapping image representations into a shared embedding space with LLMs. It employs a multi-stage training strategy: first, an autoregressive model (Qwen2.5-VL-7B-Instruct) is pre-trained on a large dataset for multimodal understanding. Then, specialized decoders (FLUX.1-Dev) are adapted for generation and editing, with the editing decoder incorporating original image embeddings for better reconstruction. This approach allows for seamless integration of language reasoning and image synthesis.

Quick Start & Requirements

  • Install by cloning the DiffSynth-Studio repository and running pip install -e . followed by pip install -r requirements.txt.
  • Requires Python and PyTorch. Specific VRAM requirements are high: 17 GB for understanding, 24 GB for generation/editing. FP8 quantization is available for memory saving.
  • Download model checkpoints using python download_models.py.
  • Official demo available via python app.py.

Highlighted Details

  • Nexus-Gen V2 achieves 45.7 on MMMU for image understanding and 0.81 on GenEval for generation robustness.
  • Supports both English and Chinese prompts for generation and editing.
  • Offers two editing decoders: a standard one and a generation-focused one for large edits.
  • Quantized versions (NF4, float8_e4m3fn) are available for reduced memory footprint.

Maintenance & Community

The project has seen recent updates (July 2025) with the release of Nexus-Gen V2 and quantized models. The primary contributors are listed in the technical report. Links to ModelScope and Hugging Face model pages are provided.

Licensing & Compatibility

The repository itself does not explicitly state a license. However, it depends on Qwen2.5-VL-7B-Instruct and FLUX.1-Dev, whose licenses should be checked for compatibility with commercial or closed-source use.

Limitations & Caveats

Inference requires substantial VRAM (17-24 GB), potentially limiting accessibility. The project is actively developed, with V2 being a recent release, suggesting potential for ongoing changes and API evolution.

Health Check
Last commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
3
Star History
29 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.