OmniGen2 by VectorSpaceLab

Multimodal generation for text and images

Created 5 months ago

3,949 stars

Top 12.3% on SourcePulse

Project Summary

OmniGen2 is a multimodal generative model designed for advanced text-to-image generation, instruction-guided image editing, and in-context visual generation. It targets researchers and power users in AI and computer vision, offering competitive performance and flexibility through its decoupled architecture and dual decoding pathways.

How It Works

OmniGen2 builds upon a Qwen-VL-2.5 foundation, featuring distinct decoding pathways for text and images with unshared parameters and a decoupled image tokenizer. This architectural choice allows for specialized processing of each modality, leading to improved fidelity and control in generation tasks. It supports various inference optimizations like CPU offload and caching mechanisms for enhanced efficiency.

Quick Start & Requirements

Installation: Clone the repository and install dependencies via pip install -r requirements.txt. PyTorch with CUDA 12.4 is recommended (pip install torch==2.6.0 torchvision --extra-index-url https://download.pytorch.org/whl/cu124). flash-attn is recommended for optimal performance.
Prerequisites: NVIDIA GPU (RTX 3090 or equivalent with ~17GB VRAM recommended), Python 3.11.
Resources: CPU offload is available for devices with limited VRAM.
Demos: Online Gradio demos and a web application are available.

Highlighted Details

Supports visual understanding, text-to-image generation, instruction-guided image editing, and in-context generation.
Offers CPU offload for reduced VRAM usage and optimizations like TeaCache and TaylorSeer for faster inference.
Provides detailed usage tips for hyperparameters like text_guidance_scale, image_guidance_scale, and max_pixels.
Includes a benchmark for in-context generation called OmniContext.

Maintenance & Community

The project has recent updates and community contributions, including official ComfyUI support. Links to demos and a technical report are provided.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README.

Limitations & Caveats

The model may not always follow instructions precisely, requiring prompt iteration or increased image guidance. Output image size defaults to 1024x1024 and may need manual adjustment for optimal results, especially when editing specific images within a batch. In-context generation can sometimes produce objects that differ from originals, with performance noted as still having a gap compared to GPT-4o.

Health Check

Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

33 stars in the last 30 days