ComfyUI_VLM_nodes by gokayfem

ComfyUI nodes for multimodal generation and prompt engineering

Created 1 year ago

549 stars

Top 58.2% on SourcePulse

Project Summary

This repository provides custom ComfyUI nodes for integrating Vision-Language Models (VLMs), Large Language Models (LLMs), and audio generation capabilities. It targets users of ComfyUI, particularly those interested in multimodal AI applications, enabling tasks like image-to-music generation, structured data extraction from images, and advanced prompt engineering.

How It Works

The nodes leverage llama-cpp-python for efficient loading and inference of LLaVA models in GGUF format, supporting various VLM architectures. For audio generation, it integrates AudioLDM-2 and ChatMusician, an LLM with intrinsic musical abilities. The project also includes nodes for structured output extraction using llama-cpp-agents, automatic prompt generation, and direct API integration with services like ChatGPT and DeepSeek.

Quick Start & Requirements

Install via git clone https://github.com/gokayfem/ComfyUI_VLM_nodes.git into ComfyUI's custom_nodes directory.
Requires Python >= 3.9.
LLaVA models (GGUF format) and their corresponding clip projectors (e.g., mmproj-model-f16.gguf) must be manually downloaded and placed in models/LLavacheckpoints.
GPU acceleration is recommended for VLM inference.
Official examples and detailed usage guides are available within the repository.

Highlighted Details

Supports a wide range of VLMs including LLaVA (1.5, 1.6), InternLM-XComposer2-VL, UForm-Gen2, Kosmos-2, Moondream, and Qwen2-VL.
Features nodes for structured output generation, keyword extraction, and multi-variant prompt suggestion.
Enables image-to-music and LLM-to-music generation pipelines.
Integrates with popular LLM APIs (ChatGPT, DeepSeek) for prompt generation and chat.

Maintenance & Community

Developed by gokayfem.
Links to examples and related repositories are provided for further exploration.

Licensing & Compatibility

The repository itself does not explicitly state a license.
Model usage is subject to the licenses of the individual VLMs and LLMs integrated (e.g., Moondream models are for research purposes only, not commercial use).

Limitations & Caveats

Some models, like InternLM-XComposer2-VL, are noted as "heavy" and require significant VRAM.
The ChatMusician integration comes with a warning that it "does NOT work perfectly" and may require re-queueing prompts.
Commercial use may be restricted depending on the specific models utilized.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

1

Star History

10 stars in the last 30 days

Explore Similar Projects

MAGIC by yxuansu

Framework for image-guided text generation using language models

Created 3 years ago

Updated 3 years ago

ComfyUI-Miaoshouai-Tagger by miaoshouai

ComfyUI extension for enhanced image captioning via fine-tuned Florence-2 model

Created 1 year ago

Updated 8 months ago

Comfyui_image2prompt by zhongpei

ComfyUI nodes for image-to-prompt workflows

Created 1 year ago

Updated 7 months ago

MILS by facebookresearch

Research paper implementation for multimodal LLM understanding

Created 1 year ago

Updated 8 months ago

z-tipo-extension by KohakuBlueleaf

SD WebUI extension for prompt upsampling using TIPO or DanTagGen

Created 1 year ago

Updated 6 days ago

awesome-prompts by songtianlun

Prompt library for multimodal AI generation

Created 4 months ago

Updated 1 month ago

org-ai by rksm

Emacs minor mode for generative AI in org-mode

Created 2 years ago

Updated 4 days ago

Starred by

Pawel Garbacki

Pawel Garbacki(Cofounder of Fireworks AI),

Andreas Jansson

Andreas Jansson(Cofounder of Replicate), and

1 more.

Emu3 by baaivision

Multimodal model for vision-language understanding and generation

Created 1 year ago

Updated 1 month ago

Starred by

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI) and

Lyumin Zhang

Lyumin Zhang(Author of ControlNet).

docker-prompt-generator by soulteary

Docker image for prompt generation

Created 2 years ago

Updated 2 years ago

Starred by

Pawel Garbacki

Pawel Garbacki(Cofounder of Fireworks AI),

Thomas Wolf

Thomas Wolf(Cofounder of Hugging Face), and

3 more.

InternLM-XComposer by InternLM

Multimodal model for long-context video/audio interactions, image understanding, and composition

Created 2 years ago

Updated 7 months ago

text2video by bravekingzhang

CLI tool for text-to-video generation

Created 2 years ago

Updated 1 year ago

Starred by

Jiaming Song

Jiaming Song(Chief Scientist at Luma AI).

OmniGen by VectorSpaceLab

Image generation model for multimodal prompts

Created 1 year ago

Updated 1 month ago

Feedback? Help us improve.