OmniGen2  by VectorSpaceLab

Multimodal generation for text and images

created 2 months ago
3,742 stars

Top 13.0% on SourcePulse

GitHubView on GitHub
Project Summary

OmniGen2 is a multimodal generative model designed for advanced text-to-image generation, instruction-guided image editing, and in-context visual generation. It targets researchers and power users in AI and computer vision, offering competitive performance and flexibility through its decoupled architecture and dual decoding pathways.

How It Works

OmniGen2 builds upon a Qwen-VL-2.5 foundation, featuring distinct decoding pathways for text and images with unshared parameters and a decoupled image tokenizer. This architectural choice allows for specialized processing of each modality, leading to improved fidelity and control in generation tasks. It supports various inference optimizations like CPU offload and caching mechanisms for enhanced efficiency.

Quick Start & Requirements

  • Installation: Clone the repository and install dependencies via pip install -r requirements.txt. PyTorch with CUDA 12.4 is recommended (pip install torch==2.6.0 torchvision --extra-index-url https://download.pytorch.org/whl/cu124). flash-attn is recommended for optimal performance.
  • Prerequisites: NVIDIA GPU (RTX 3090 or equivalent with ~17GB VRAM recommended), Python 3.11.
  • Resources: CPU offload is available for devices with limited VRAM.
  • Demos: Online Gradio demos and a web application are available.

Highlighted Details

  • Supports visual understanding, text-to-image generation, instruction-guided image editing, and in-context generation.
  • Offers CPU offload for reduced VRAM usage and optimizations like TeaCache and TaylorSeer for faster inference.
  • Provides detailed usage tips for hyperparameters like text_guidance_scale, image_guidance_scale, and max_pixels.
  • Includes a benchmark for in-context generation called OmniContext.

Maintenance & Community

The project has recent updates and community contributions, including official ComfyUI support. Links to demos and a technical report are provided.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README.

Limitations & Caveats

The model may not always follow instructions precisely, requiring prompt iteration or increased image guidance. Output image size defaults to 1024x1024 and may need manual adjustment for optimal results, especially when editing specific images within a batch. In-context generation can sometimes produce objects that differ from originals, with performance noted as still having a gap compared to GPT-4o.

Health Check
Last commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
16
Star History
336 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Zhiqiang Xie Zhiqiang Xie(Author of SGLang), and
1 more.

Sana by NVlabs

0.2%
4k
Image synthesis research paper using a linear diffusion transformer
created 10 months ago
updated 1 month ago
Feedback? Help us improve.