ComfyUI nodes for multimodal generation and prompt engineering
Top 62.4% on sourcepulse
This repository provides custom ComfyUI nodes for integrating Vision-Language Models (VLMs), Large Language Models (LLMs), and audio generation capabilities. It targets users of ComfyUI, particularly those interested in multimodal AI applications, enabling tasks like image-to-music generation, structured data extraction from images, and advanced prompt engineering.
How It Works
The nodes leverage llama-cpp-python
for efficient loading and inference of LLaVA models in GGUF format, supporting various VLM architectures. For audio generation, it integrates AudioLDM-2 and ChatMusician, an LLM with intrinsic musical abilities. The project also includes nodes for structured output extraction using llama-cpp-agents
, automatic prompt generation, and direct API integration with services like ChatGPT and DeepSeek.
Quick Start & Requirements
git clone https://github.com/gokayfem/ComfyUI_VLM_nodes.git
into ComfyUI's custom_nodes
directory.mmproj-model-f16.gguf
) must be manually downloaded and placed in models/LLavacheckpoints
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
5 months ago
1 day