Image captioning VLM for diffusion model training, aiming for uncensored, open use
Top 48.2% on sourcepulse
JoyCaption is an open-source Visual Language Model (VLM) designed for generating uncensored image captions, primarily aimed at users training diffusion models. It offers broad content and style coverage, including NSFW concepts, and provides detailed training scripts for community use.
How It Works
JoyCaption is built upon the Llama 3.1 architecture, fine-tuned for image captioning. It leverages a multimodal approach, processing both image and text inputs to generate descriptive captions. The model is designed to be uncensored and aims to match or exceed the performance of proprietary models like GPT-4o in captioning quality, particularly outside the SFW domain.
Quick Start & Requirements
transformers
library.transformers
library. GPU recommended for inference.Highlighted Details
Maintenance & Community
The project is actively developed, currently at "Alpha Two." Feedback and contributions are encouraged. Release history and announcements are linked via Reddit and Civitai.
Licensing & Compatibility
The model weights are released under an open, free license with no restrictions. Compatibility for commercial use or closed-source linking is implied by the "no restrictions" claim.
Limitations & Caveats
JoyCaption is an experimental alpha release and not production-ready. Known limitations include potential issues with character interactions, OCR, and left/right confusion. The model is heavily optimized for specific prompt formats, and results may vary with general instructions.
1 week ago
1 day