joycaption by fpgaminer

Image captioning VLM for diffusion model training, aiming for uncensored, open use

Created 1 year ago

1,033 stars

Top 36.3% on SourcePulse

Project Summary

JoyCaption is an open-source Visual Language Model (VLM) designed for generating uncensored image captions, primarily aimed at users training diffusion models. It offers broad content and style coverage, including NSFW concepts, and provides detailed training scripts for community use.

How It Works

JoyCaption is built upon the Llama 3.1 architecture, fine-tuned for image captioning. It leverages a multimodal approach, processing both image and text inputs to generate descriptive captions. The model is designed to be uncensored and aims to match or exceed the performance of proprietary models like GPT-4o in captioning quality, particularly outside the SFW domain.

Quick Start & Requirements

Install/Run: Load model via Hugging Face transformers library.
Prerequisites: Python, PyTorch (bfloat16 support), transformers library. GPU recommended for inference.
Demo: Available on HuggingFace Spaces.
Docs: Example usage and prompt details provided in the README.

Highlighted Details

Uncensored: Explicitly trained to cover NSFW concepts without filtering.
Versatile Prompting: Supports multiple captioning styles including descriptive, Stable Diffusion prompts, MidJourney prompts, Booru tags, and art critiques.
Community Focus: Aims to be free, open-source, with released training scripts and detailed build information.
Performance Goal: Targets near GPT-4o captioning performance.

Maintenance & Community

The project is actively developed, currently at "Alpha Two." Feedback and contributions are encouraged. Release history and announcements are linked via Reddit and Civitai.

Licensing & Compatibility

The model weights are released under an open, free license with no restrictions. Compatibility for commercial use or closed-source linking is implied by the "no restrictions" claim.

Limitations & Caveats

JoyCaption is an experimental alpha release and not production-ready. Known limitations include potential issues with character interactions, OCR, and left/right confusion. The model is heavily optimized for specific prompt formats, and results may vary with general instructions.

joycaption by fpgaminer

Explore Similar Projects

lens by ContextualAI

OneDiffusion by lehduong

Mini-DALLE3 by Zeqiang-Lai

fromage by kohjingyu

MILS by facebookresearch

VisualGPT by Vision-CAIR

ELLA by TencentQQGYLab

Caption-Anything by ttengwang

Awesome-Text-to-Image by Yutong-Zhou-cv

CLIP_prefix_caption by rmokady

big-sleep by lucidrains

instruct-pix2pix by timothybrooks