qwen2vl-flux  by erwold

Image generation model for multimodal control

Created 9 months ago
565 stars

Top 56.9% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides Qwen2VL-Flux, a controllable image generation model that unifies text and image guidance by integrating Qwen2VL's multimodal understanding with the Flux architecture and Stable Diffusion. It targets researchers and power users seeking advanced image manipulation capabilities, offering enhanced control through ControlNet features like depth and line detection.

How It Works

The model enhances Stable Diffusion by replacing its traditional text encoder with Qwen2VL, a vision-language model, for superior multimodal comprehension. It leverages the Flux architecture and integrates ControlNet for precise structural guidance, enabling various generation modes like variation, img2img, inpainting, and ControlNet-guided generation. This approach allows for more nuanced control over image output using both textual prompts and visual references.

Quick Start & Requirements

  • Install: pip install -r requirements.txt after cloning the repository.
  • Prerequisites: Python 3.8+, PyTorch >= 2.4.1, Transformers 4.45.0, Diffusers 0.30.0, Accelerate 0.33.0. A CUDA-compatible GPU with 48GB+ memory is recommended.
  • Model Checkpoints: Requires downloading Qwen2VL-Flux, Qwen2VL, and optionally ControlNet, Depth Anything V2, Mistoline, and SAM2 models into a checkpoints directory.
  • Configuration: Model paths can be set in model.py or via the CHECKPOINT_DIR environment variable.
  • Usage: python main.py --mode <mode> --input_image <path> [additional options]
  • Docs: Technical Report available.

Highlighted Details

  • Supports multiple generation modes: variation, img2img, inpainting, ControlNet, and ControlNet-inpaint.
  • Integrates ControlNet for line detection and depth-aware generation with adjustable strengths.
  • Features advanced options like attention control for focused generation and Turbo mode for faster inference.
  • Implements smart model loading, only loading necessary components for specific tasks to optimize memory usage.

Maintenance & Community

  • The project is maintained by Pengqi Lu.
  • Contributions are welcome via Pull Requests.
  • Citation details for the technical report are provided.

Licensing & Compatibility

  • The repository does not explicitly state a license in the README.

Limitations & Caveats

  • The README does not specify a license, which may impact commercial use or closed-source integration.
  • High GPU memory requirements (48GB+) may be a barrier for users without specialized hardware.
Health Check
Last Commit

9 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 30 days

Explore Similar Projects

Starred by Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and
2 more.

IP-Adapter by tencent-ailab

0.3%
6k
Adapter for image prompt in text-to-image diffusion models
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.