ComfyUI_Qwen3-VL-Instruct by IuvenisSapiens

Multimodal AI for ComfyUI

Created 1 year ago

474 stars

Top 64.3% on SourcePulse

Project Summary

This ComfyUI custom node integrates the Qwen3-VL-Instruct multimodal model, enabling users to generate captions and responses from diverse inputs including text, single images, multiple images, and video. It targets ComfyUI users seeking to leverage advanced visual-language understanding within their existing node-based workflows for tasks like image description, video analysis, and multi-image storytelling.

How It Works

The node acts as an interface to the Qwen3-VL-Instruct model, processing user-provided text prompts, single or multiple images, and video files. It analyzes these inputs to generate relevant textual outputs, such as detailed captions for images or videos, or narrative summaries that connect a series of images. The core advantage lies in bringing sophisticated multimodal AI capabilities directly into the ComfyUI ecosystem.

Quick Start & Requirements

Installation: Install via ComfyUI Manager (search for "Qwen3") or clone the repository into ComfyUI/custom_nodes/ and execute pip install -r requirements.txt.
Prerequisites: A "Display Text node" is mandatory. If absent, it must be installed from the ComfyUI_MiniCPM-V-4_5 repository.
Models: Models are automatically downloaded to ComfyUI/models/prompt_generator/ upon first use if not present.

Highlighted Details

Comprehensive multimodal support: text, single image, multiple images, and video queries.
Generates descriptive captions and narrative responses tailored to input types.
Designed for seamless integration within the ComfyUI graphical interface.

Maintenance & Community

The provided README does not contain information regarding maintainers, community channels (e.g., Discord, Slack), or a public roadmap.

Licensing & Compatibility

The README does not specify the software license or any compatibility notes for commercial or closed-source use.

Limitations & Caveats

Setup requires ensuring the "Display Text node" is available, potentially necessitating an additional installation.
The README lacks details on performance metrics, specific hardware requirements (beyond general ComfyUI needs), or known bugs.

Health Check

Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)

1

Issues (30d)

2

Star History

57 stars in the last 30 days

Explore Similar Projects

Keye by Kwai-Keye

Multimodal LLM for video and image understanding

Created 6 months ago

Updated 1 month ago

univa by univa-agent

AI-powered system for universal video creation and direction

Created 2 months ago

Updated 1 month ago

Qwen3-SmVL by ShaohonChen

Combine Qwen3 and SmolVLM2 for Chinese multimodal understanding

Created 6 months ago

Updated 4 months ago

ComfyUI-QwenVL by 1038lab

Multimodal AI integration for ComfyUI

Created 9 months ago

Updated 2 weeks ago

PandaGPT by yxuansu

Multimodal model for instruction following across six modalities

Created 2 years ago

Updated 2 years ago

StoryToolkitAI by octimot

AI film editing tool for efficient footage understanding

Created 3 years ago

Updated 10 months ago

PIA by open-mmlab

Image animator for personalized video generation via text prompts

Created 2 years ago

Updated 1 year ago

VideoLLaMA3 by DAMO-NLP-SG

Multimodal foundation model for image/video understanding

Created 11 months ago

Updated 5 months ago

auto-video-generateor by kuangdd2024

Automatic video generator from a topic

Created 1 year ago

Updated 2 months ago

Starred by

Pawel Garbacki

Pawel Garbacki(Cofounder of Fireworks AI),

Thomas Wolf

Thomas Wolf(Cofounder of Hugging Face), and

3 more.

InternLM-XComposer by InternLM

Multimodal model for long-context video/audio interactions, image understanding, and composition

Created 2 years ago

Updated 7 months ago

Pixelle-Video by AIDC-AI

AI engine for fully automated short video creation

Created 2 months ago

Updated 3 days ago

Starred by

Jiaming Song

Jiaming Song(Chief Scientist at Luma AI),

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI), and

7 more.

Janus by deepseek-ai

Unified multimodal model research paper for understanding and generation

Created 1 year ago

Updated 11 months ago

Feedback? Help us improve.