qwen2vl-flux by erwold

Image generation model for multimodal control

Created 1 year ago

572 stars

Top 56.3% on SourcePulse

Project Summary

This repository provides Qwen2VL-Flux, a controllable image generation model that unifies text and image guidance by integrating Qwen2VL's multimodal understanding with the Flux architecture and Stable Diffusion. It targets researchers and power users seeking advanced image manipulation capabilities, offering enhanced control through ControlNet features like depth and line detection.

How It Works

The model enhances Stable Diffusion by replacing its traditional text encoder with Qwen2VL, a vision-language model, for superior multimodal comprehension. It leverages the Flux architecture and integrates ControlNet for precise structural guidance, enabling various generation modes like variation, img2img, inpainting, and ControlNet-guided generation. This approach allows for more nuanced control over image output using both textual prompts and visual references.

Quick Start & Requirements

Install: pip install -r requirements.txt after cloning the repository.
Prerequisites: Python 3.8+, PyTorch >= 2.4.1, Transformers 4.45.0, Diffusers 0.30.0, Accelerate 0.33.0. A CUDA-compatible GPU with 48GB+ memory is recommended.
Model Checkpoints: Requires downloading Qwen2VL-Flux, Qwen2VL, and optionally ControlNet, Depth Anything V2, Mistoline, and SAM2 models into a checkpoints directory.
Configuration: Model paths can be set in model.py or via the CHECKPOINT_DIR environment variable.
Usage: python main.py --mode <mode> --input_image <path> [additional options]
Docs: Technical Report available.

Highlighted Details

Supports multiple generation modes: variation, img2img, inpainting, ControlNet, and ControlNet-inpaint.
Integrates ControlNet for line detection and depth-aware generation with adjustable strengths.
Features advanced options like attention control for focused generation and Turbo mode for faster inference.
Implements smart model loading, only loading necessary components for specific tasks to optimize memory usage.

Maintenance & Community

The project is maintained by Pengqi Lu.
Contributions are welcome via Pull Requests.
Citation details for the technical report are provided.

Licensing & Compatibility

The repository does not explicitly state a license in the README.

Limitations & Caveats

The README does not specify a license, which may impact commercial use or closed-source integration.
High GPU memory requirements (48GB+) may be a barrier for users without specialized hardware.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days