Multimodal generation for text and images
Top 13.0% on SourcePulse
OmniGen2 is a multimodal generative model designed for advanced text-to-image generation, instruction-guided image editing, and in-context visual generation. It targets researchers and power users in AI and computer vision, offering competitive performance and flexibility through its decoupled architecture and dual decoding pathways.
How It Works
OmniGen2 builds upon a Qwen-VL-2.5 foundation, featuring distinct decoding pathways for text and images with unshared parameters and a decoupled image tokenizer. This architectural choice allows for specialized processing of each modality, leading to improved fidelity and control in generation tasks. It supports various inference optimizations like CPU offload and caching mechanisms for enhanced efficiency.
Quick Start & Requirements
pip install -r requirements.txt
. PyTorch with CUDA 12.4 is recommended (pip install torch==2.6.0 torchvision --extra-index-url https://download.pytorch.org/whl/cu124
). flash-attn
is recommended for optimal performance.Highlighted Details
text_guidance_scale
, image_guidance_scale
, and max_pixels
.Maintenance & Community
The project has recent updates and community contributions, including official ComfyUI support. Links to demos and a technical report are provided.
Licensing & Compatibility
The repository does not explicitly state a license in the provided README.
Limitations & Caveats
The model may not always follow instructions precisely, requiring prompt iteration or increased image guidance. Output image size defaults to 1024x1024 and may need manual adjustment for optimal results, especially when editing specific images within a batch. In-context generation can sometimes produce objects that differ from originals, with performance noted as still having a gap compared to GPT-4o.
3 weeks ago
Inactive