ComfyUI-JoyCaption  by 1038lab

LLaVA-powered ComfyUI node for stylized image captioning

Created 10 months ago
253 stars

Top 99.3% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

ComfyUI-JoyCaption provides a custom ComfyUI node for generating stylized image captions using the LLaVA model. It targets AI art creators and researchers needing automated, context-aware image descriptions. The key benefit is enabling flexible, high-quality captioning directly within ComfyUI workflows, enhancing productivity for tasks like dataset generation or content analysis.

How It Works

This node integrates LLaVA's multimodal capabilities into ComfyUI. It offers robust support for quantized GGUF models via llama-cpp-python, enabling efficient inference with reduced memory requirements. This approach allows users to leverage powerful captioning models on diverse hardware. The system includes dedicated nodes for batch image processing and caption saving, alongside configurable parameters for caption style, length, and advanced generation controls.

Quick Start & Requirements

  • Installation: Clone the repository to ComfyUI/custom_nodes and run pip install -r requirements.txt.
  • GGUF Models: Automated installation of llama-cpp-python with CUDA support is recommended (python llama_cpp_install/llama_cpp_install.py).
  • Dependencies: ComfyUI, Python. GGUF models range from ~4GB to ~18GB in size, with varying VRAM requirements. Standard models need ~8-16GB+ VRAM.
  • Links: GitHub Repository

Highlighted Details

  • Comprehensive GGUF model support across 12 quantization levels (Q2_K to F16), optimizing for memory and speed.
  • Automatic model downloading and renaming on first use.
  • Batch processing via "Caption Tools" nodes for efficient multi-image workflows.
  • Diverse caption styles (Descriptive, Tags, Artistic) and length controls.
  • Flexible memory management options (Global Cache, Keep in Memory, Clear After Run) adaptable to GPU VRAM.

Maintenance & Community

The project exhibits active development, with frequent updates logged throughout 2025, indicating strong maintenance. No specific community channels (e.g., Discord, Slack) are listed in the README.

Licensing & Compatibility

The code is licensed under GPL-3.0. This copyleft license requires derivative works to be distributed under the same terms, potentially impacting integration with proprietary software.

Limitations & Caveats

Lower GGUF quantization levels may slightly reduce caption quality. Optimal performance, especially for batch processing, is recommended with sufficient VRAM (12GB+), and input images are best processed at 512x512 resolution or higher.

Health Check
Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
17 stars in the last 30 days

Explore Similar Projects

Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Zack Li Zack Li(Cofounder of Nexa AI), and
19 more.

LLaVA by haotian-liu

0.1%
25k
Multimodal assistant with GPT-4 level capabilities
Created 3 years ago
Updated 1 year ago
Feedback? Help us improve.