qwen2vl-flux  by erwold

Image generation model for multimodal control

created 8 months ago
553 stars

Top 58.8% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides Qwen2VL-Flux, a controllable image generation model that unifies text and image guidance by integrating Qwen2VL's multimodal understanding with the Flux architecture and Stable Diffusion. It targets researchers and power users seeking advanced image manipulation capabilities, offering enhanced control through ControlNet features like depth and line detection.

How It Works

The model enhances Stable Diffusion by replacing its traditional text encoder with Qwen2VL, a vision-language model, for superior multimodal comprehension. It leverages the Flux architecture and integrates ControlNet for precise structural guidance, enabling various generation modes like variation, img2img, inpainting, and ControlNet-guided generation. This approach allows for more nuanced control over image output using both textual prompts and visual references.

Quick Start & Requirements

  • Install: pip install -r requirements.txt after cloning the repository.
  • Prerequisites: Python 3.8+, PyTorch >= 2.4.1, Transformers 4.45.0, Diffusers 0.30.0, Accelerate 0.33.0. A CUDA-compatible GPU with 48GB+ memory is recommended.
  • Model Checkpoints: Requires downloading Qwen2VL-Flux, Qwen2VL, and optionally ControlNet, Depth Anything V2, Mistoline, and SAM2 models into a checkpoints directory.
  • Configuration: Model paths can be set in model.py or via the CHECKPOINT_DIR environment variable.
  • Usage: python main.py --mode <mode> --input_image <path> [additional options]
  • Docs: Technical Report available.

Highlighted Details

  • Supports multiple generation modes: variation, img2img, inpainting, ControlNet, and ControlNet-inpaint.
  • Integrates ControlNet for line detection and depth-aware generation with adjustable strengths.
  • Features advanced options like attention control for focused generation and Turbo mode for faster inference.
  • Implements smart model loading, only loading necessary components for specific tasks to optimize memory usage.

Maintenance & Community

  • The project is maintained by Pengqi Lu.
  • Contributions are welcome via Pull Requests.
  • Citation details for the technical report are provided.

Licensing & Compatibility

  • The repository does not explicitly state a license in the README.

Limitations & Caveats

  • The README does not specify a license, which may impact commercial use or closed-source integration.
  • High GPU memory requirements (48GB+) may be a barrier for users without specialized hardware.
Health Check
Last commit

8 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
68 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.