Multimodal model for Korean visual instruction following
Top 91.2% on sourcepulse
KoLLaVA is a multimodal assistant designed for Korean language users, enabling image-based conversations. It extends the LLaVA (Large Language and Vision Assistant) framework, offering Korean-specific datasets and fine-tuned models for enhanced performance in Korean visual question answering and instruction following.
How It Works
KoLLaVA builds upon the LLaVA architecture, which connects a vision encoder (CLIP ViT-L/14) with a large language model (LLM) via a projection layer. The project involves two main training stages: feature alignment (pretraining) using filtered CC3M datasets and visual instruction tuning using a curated Korean multimodal instruction dataset. This approach allows the model to understand and respond to queries involving both images and Korean text.
Quick Start & Requirements
pip install -e .
(add .[train]
for training).--device mps
.python -m llava.serve.cli --model-path tabtoyou/KoLLaVA-v1.5-Synatra-7b --image-file <image_url>
.Highlighted Details
KoLLaVA-v1.5-Synatra-7b
and KoLLaVA-LLaMA-v2-7b-qlora-4bit
.KoLLaVA-Instruct-150k
and KoLLaVA-CC3M-Pretrain-595K
.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
10 months ago
Inactive