ComfyUI-JoyCaption by 1038lab

LLaVA-powered ComfyUI node for stylized image captioning

Created 1 year ago

288 stars

Top 91.0% on SourcePulse

Project Summary

Summary

ComfyUI-JoyCaption provides a custom ComfyUI node for generating stylized image captions using the LLaVA model. It targets AI art creators and researchers needing automated, context-aware image descriptions. The key benefit is enabling flexible, high-quality captioning directly within ComfyUI workflows, enhancing productivity for tasks like dataset generation or content analysis.

How It Works

This node integrates LLaVA's multimodal capabilities into ComfyUI. It offers robust support for quantized GGUF models via llama-cpp-python, enabling efficient inference with reduced memory requirements. This approach allows users to leverage powerful captioning models on diverse hardware. The system includes dedicated nodes for batch image processing and caption saving, alongside configurable parameters for caption style, length, and advanced generation controls.

Quick Start & Requirements

Installation: Clone the repository to ComfyUI/custom_nodes and run pip install -r requirements.txt.
GGUF Models: Automated installation of llama-cpp-python with CUDA support is recommended (python llama_cpp_install/llama_cpp_install.py).
Dependencies: ComfyUI, Python. GGUF models range from ~4GB to ~18GB in size, with varying VRAM requirements. Standard models need ~8-16GB+ VRAM.
Links: GitHub Repository

Highlighted Details

Comprehensive GGUF model support across 12 quantization levels (Q2_K to F16), optimizing for memory and speed.
Automatic model downloading and renaming on first use.
Batch processing via "Caption Tools" nodes for efficient multi-image workflows.
Diverse caption styles (Descriptive, Tags, Artistic) and length controls.
Flexible memory management options (Global Cache, Keep in Memory, Clear After Run) adaptable to GPU VRAM.

Maintenance & Community

The project exhibits active development, with frequent updates logged throughout 2025, indicating strong maintenance. No specific community channels (e.g., Discord, Slack) are listed in the README.

Licensing & Compatibility

The code is licensed under GPL-3.0. This copyleft license requires derivative works to be distributed under the same terms, potentially impacting integration with proprietary software.

Limitations & Caveats

Lower GGUF quantization levels may slightly reduce caption quality. Optimal performance, especially for batch processing, is recommended with sufficient VRAM (12GB+), and input images are best processed at 512x512 resolution or higher.

Health Check

Last Commit

6 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

9 stars in the last 30 days