ComfyUI_VLM_nodes  by gokayfem

ComfyUI nodes for multimodal generation and prompt engineering

created 1 year ago
506 stars

Top 62.4% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides custom ComfyUI nodes for integrating Vision-Language Models (VLMs), Large Language Models (LLMs), and audio generation capabilities. It targets users of ComfyUI, particularly those interested in multimodal AI applications, enabling tasks like image-to-music generation, structured data extraction from images, and advanced prompt engineering.

How It Works

The nodes leverage llama-cpp-python for efficient loading and inference of LLaVA models in GGUF format, supporting various VLM architectures. For audio generation, it integrates AudioLDM-2 and ChatMusician, an LLM with intrinsic musical abilities. The project also includes nodes for structured output extraction using llama-cpp-agents, automatic prompt generation, and direct API integration with services like ChatGPT and DeepSeek.

Quick Start & Requirements

  • Install via git clone https://github.com/gokayfem/ComfyUI_VLM_nodes.git into ComfyUI's custom_nodes directory.
  • Requires Python >= 3.9.
  • LLaVA models (GGUF format) and their corresponding clip projectors (e.g., mmproj-model-f16.gguf) must be manually downloaded and placed in models/LLavacheckpoints.
  • GPU acceleration is recommended for VLM inference.
  • Official examples and detailed usage guides are available within the repository.

Highlighted Details

  • Supports a wide range of VLMs including LLaVA (1.5, 1.6), InternLM-XComposer2-VL, UForm-Gen2, Kosmos-2, Moondream, and Qwen2-VL.
  • Features nodes for structured output generation, keyword extraction, and multi-variant prompt suggestion.
  • Enables image-to-music and LLM-to-music generation pipelines.
  • Integrates with popular LLM APIs (ChatGPT, DeepSeek) for prompt generation and chat.

Maintenance & Community

  • Developed by gokayfem.
  • Links to examples and related repositories are provided for further exploration.

Licensing & Compatibility

  • The repository itself does not explicitly state a license.
  • Model usage is subject to the licenses of the individual VLMs and LLMs integrated (e.g., Moondream models are for research purposes only, not commercial use).

Limitations & Caveats

  • Some models, like InternLM-XComposer2-VL, are noted as "heavy" and require significant VRAM.
  • The ChatMusician integration comes with a warning that it "does NOT work perfectly" and may require re-queueing prompts.
  • Commercial use may be restricted depending on the specific models utilized.
Health Check
Last commit

5 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
21 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.