deepgen  by deepgenteam

Unified multimodal model for advanced image generation and editing

Created 1 month ago
420 stars

Top 70.0% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

DeepGen 1.0 is a lightweight, unified multimodal model designed for advanced image generation and editing. Targeting researchers and practitioners, it offers comprehensive capabilities within a compact 5B parameter architecture, demonstrating that high performance in multimodal tasks can be achieved without massive scaling.

How It Works

This 5B parameter model (3B VLM + 2B DiT) leverages synergistic architecture and data-centric training. Its core innovation is the Stacked Channel Bridging (SCB) framework, which extracts hierarchical VLM features and fuses them with learnable "think tokens" to provide structured guidance to the diffusion backbone. Training progresses through three stages: Alignment Pre-training on image-text and editing data, Joint Supervised Fine-tuning across diverse tasks, and Reinforcement Learning with MR-GRPO for enhanced quality and human preference alignment. This approach enables competitive performance against models 3x-16x larger.

Quick Start & Requirements

Setup involves cloning the repository, creating a Python 3.12 Conda environment, and installing dependencies including flash_attn==2.8.3, xtuner==0.2.0, transformers==4.56.1, triton==2.3.0, and opencv-python-headless. CUDA is implicitly required for GPU acceleration. Pre-trained checkpoints and a diffusers-compatible model are available on Hugging Face.

Highlighted Details

  • Compact Architecture: Features a 5B parameter count (3B VLM + 2B DiT), significantly smaller than many state-of-the-art unified models.
  • Unified Capabilities: Integrates general image generation, general image editing, reasoning image generation, reasoning image editing, and text rendering.
  • Competitive Performance: Achieves state-of-the-art or near state-of-the-art results across multiple benchmarks (e.g., Geneval, DPGBench, UniGenBench, GEdit-EN, UniREditBench), often surpassing much larger models.
  • Novel Techniques: Employs Stacked Channel Bridging (SCB) for VLM-DiT alignment and a three-stage training strategy culminating in MR-GRPO reinforcement learning.

Maintenance & Community

Developed by the DeepGen Team at Shanghai Innovation Institute. Contact emails provided are dywang24@m.fudan.edu.cn and wjqdev@gmail.com. No community channels (e.g., Discord, Slack) are listed in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the README, which presents a significant hurdle for adoption decisions, particularly regarding commercial use or integration into closed-source projects.

Limitations & Caveats

No explicit limitations are detailed. However, the project's performance is heavily reliant on the integration of numerous external pioneering works and datasets, as acknowledged. The absence of a clear license is a primary adoption blocker.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
3
Star History
191 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind).

RPG-DiffusionMaster by YangLing0818

0%
2k
Training-free paradigm for text-to-image generation/editing
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.