KoLLaVA  by tabtoyou

Multimodal model for Korean visual instruction following

created 2 years ago
293 stars

Top 91.2% on sourcepulse

GitHubView on GitHub
Project Summary

KoLLaVA is a multimodal assistant designed for Korean language users, enabling image-based conversations. It extends the LLaVA (Large Language and Vision Assistant) framework, offering Korean-specific datasets and fine-tuned models for enhanced performance in Korean visual question answering and instruction following.

How It Works

KoLLaVA builds upon the LLaVA architecture, which connects a vision encoder (CLIP ViT-L/14) with a large language model (LLM) via a projection layer. The project involves two main training stages: feature alignment (pretraining) using filtered CC3M datasets and visual instruction tuning using a curated Korean multimodal instruction dataset. This approach allows the model to understand and respond to queries involving both images and Korean text.

Quick Start & Requirements

  • Install: Clone the repository and install dependencies using pip install -e . (add .[train] for training).
  • Prerequisites: Python 3.10, Conda environment recommended. macOS users may need to specify --device mps.
  • Inference: Run python -m llava.serve.cli --model-path tabtoyou/KoLLaVA-v1.5-Synatra-7b --image-file <image_url>.
  • Docs: LLaVA Project Page

Highlighted Details

  • Offers multiple Korean-tuned models, including KoLLaVA-v1.5-Synatra-7b and KoLLaVA-LLaMA-v2-7b-qlora-4bit.
  • Provides Korean versions of LLaVA datasets: KoLLaVA-Instruct-150k and KoLLaVA-CC3M-Pretrain-595K.
  • Training scripts are available for both pretraining and fine-tuning, supporting DeepSpeed ZeRO-2/3 and QLoRA.

Maintenance & Community

  • The project is a collaborative effort by the "Team KoLLaVA-v1" and supported by "복지이십사" for v1.5.
  • Links to related LLaVA projects and papers are provided.

Licensing & Compatibility

  • The project's datasets and models are intended for research use only.
  • Usage is restricted by the licenses of LLaMA, Vicuna, and GPT-4.
  • Datasets are licensed under CC BY NC 4.0, prohibiting commercial use.

Limitations & Caveats

  • Demo services are temporarily suspended due to cloud GPU costs.
  • Some Korean datasets are DeepL translations, which may contain quality issues.
Health Check
Last commit

10 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.