Discover and explore top open-source AI tools and projects—updated daily.
deepgenteamUnified multimodal model for advanced image generation and editing
Top 70.0% on SourcePulse
Summary
DeepGen 1.0 is a lightweight, unified multimodal model designed for advanced image generation and editing. Targeting researchers and practitioners, it offers comprehensive capabilities within a compact 5B parameter architecture, demonstrating that high performance in multimodal tasks can be achieved without massive scaling.
How It Works
This 5B parameter model (3B VLM + 2B DiT) leverages synergistic architecture and data-centric training. Its core innovation is the Stacked Channel Bridging (SCB) framework, which extracts hierarchical VLM features and fuses them with learnable "think tokens" to provide structured guidance to the diffusion backbone. Training progresses through three stages: Alignment Pre-training on image-text and editing data, Joint Supervised Fine-tuning across diverse tasks, and Reinforcement Learning with MR-GRPO for enhanced quality and human preference alignment. This approach enables competitive performance against models 3x-16x larger.
Quick Start & Requirements
Setup involves cloning the repository, creating a Python 3.12 Conda environment, and installing dependencies including flash_attn==2.8.3, xtuner==0.2.0, transformers==4.56.1, triton==2.3.0, and opencv-python-headless. CUDA is implicitly required for GPU acceleration. Pre-trained checkpoints and a diffusers-compatible model are available on Hugging Face.
Highlighted Details
Maintenance & Community
Developed by the DeepGen Team at Shanghai Innovation Institute. Contact emails provided are dywang24@m.fudan.edu.cn and wjqdev@gmail.com. No community channels (e.g., Discord, Slack) are listed in the README.
Licensing & Compatibility
The repository's license is not explicitly stated in the README, which presents a significant hurdle for adoption decisions, particularly regarding commercial use or integration into closed-source projects.
Limitations & Caveats
No explicit limitations are detailed. However, the project's performance is heavily reliant on the integration of numerous external pioneering works and datasets, as acknowledged. The absence of a clear license is a primary adoption blocker.
1 month ago
Inactive
YangLing0818