KoLLaVA by tabtoyou

Multimodal model for Korean visual instruction following

Created 2 years ago

297 stars

Top 89.4% on SourcePulse

Project Summary

KoLLaVA is a multimodal assistant designed for Korean language users, enabling image-based conversations. It extends the LLaVA (Large Language and Vision Assistant) framework, offering Korean-specific datasets and fine-tuned models for enhanced performance in Korean visual question answering and instruction following.

How It Works

KoLLaVA builds upon the LLaVA architecture, which connects a vision encoder (CLIP ViT-L/14) with a large language model (LLM) via a projection layer. The project involves two main training stages: feature alignment (pretraining) using filtered CC3M datasets and visual instruction tuning using a curated Korean multimodal instruction dataset. This approach allows the model to understand and respond to queries involving both images and Korean text.

Quick Start & Requirements

Install: Clone the repository and install dependencies using pip install -e . (add .[train] for training).
Prerequisites: Python 3.10, Conda environment recommended. macOS users may need to specify --device mps.
Inference: Run python -m llava.serve.cli --model-path tabtoyou/KoLLaVA-v1.5-Synatra-7b --image-file <image_url>.
Docs: LLaVA Project Page

Highlighted Details

Offers multiple Korean-tuned models, including KoLLaVA-v1.5-Synatra-7b and KoLLaVA-LLaMA-v2-7b-qlora-4bit.
Provides Korean versions of LLaVA datasets: KoLLaVA-Instruct-150k and KoLLaVA-CC3M-Pretrain-595K.
Training scripts are available for both pretraining and fine-tuning, supporting DeepSpeed ZeRO-2/3 and QLoRA.

Maintenance & Community

The project is a collaborative effort by the "Team KoLLaVA-v1" and supported by "복지이십사" for v1.5.
Links to related LLaVA projects and papers are provided.

Licensing & Compatibility

The project's datasets and models are intended for research use only.
Usage is restricted by the licenses of LLaMA, Vicuna, and GPT-4.
Datasets are licensed under CC BY NC 4.0, prohibiting commercial use.

Limitations & Caveats

Demo services are temporarily suspended due to cloud GPU costs.
Some Korean datasets are DeepL translations, which may contain quality issues.

Health Check

Last Commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)

0

Issues (30d)

0

Star History

0 stars in the last 30 days

Explore Similar Projects

Starred by

Binyuan Hui

Binyuan Hui(Research Scientist at Alibaba Qwen),

Luca Soldaini

Luca Soldaini(Research Scientist at Ai2), and

1 more.

instruction-datasets by raunak-agarwal

Dataset list for instruction tuning of LLMs

Created 2 years ago

Updated 2 years ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory).

llama3-chinese by seanzhang-zhichen

Large language model for Chinese language tasks

Created 1 year ago

Updated 1 year ago

Chinese-LLaVA by LinkSoul-AI

Open-source, commercially usable multimodal model for bilingual visual-text dialogue

Created 2 years ago

Updated 2 years ago

Starred by

Victor Taelin

Victor Taelin(Author of Bend, Kind, HVM).

WizardVicunaLM by melodysdreamj

LLM combining instruction depth with multi-turn conversation

Created 2 years ago

Updated 2 years ago

Visual-Chinese-LLaMA-Alpaca by airaria

Multimodal model for Chinese visual understanding and instruction following

Created 2 years ago

Updated 2 years ago

PandaGPT by yxuansu

Multimodal model for instruction following across six modalities

Created 2 years ago

Updated 2 years ago

Starred by

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI),

Shizhe Diao

Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and

1 more.

Multimodal-GPT by open-mmlab

Multimodal chatbot for visual/language instructions (research paper)

Created 2 years ago

Updated 2 years ago

KoAlpaca by Beomi

Korean LLM fine-tuning project

Created 2 years ago

Updated 1 year ago

Starred by

Leandro von Werra

Leandro von Werra(Head of Research at Hugging Face),

Lewis Tunstall

Lewis Tunstall(Research Engineer at Hugging Face), and

10 more.

smollm by huggingface

Lightweight AI models for text and vision tasks

Created 1 year ago

Updated 1 month ago

Starred by

Paras Jain

Paras Jain(Cofounder of Genmo),

Jesse Clark

Jesse Clark(Cofounder of Marqo), and

2 more.

Video-LLaMA by DAMO-NLP-SG

Multimodal model for video understanding research

Created 2 years ago

Updated 1 year ago

py-gpt by szczyglis-dev

Desktop AI assistant for multimodal interaction with various LLMs

Created 2 years ago

Updated 3 days ago

Starred by

Pawel Garbacki

Pawel Garbacki(Cofounder of Fireworks AI),

Forrest Iandola

Forrest Iandola(Author of SqueezeNet; Research Scientist at Meta), and

18 more.

MiniGPT-4 by Vision-CAIR

Vision-language model for multi-task learning

Created 2 years ago

Updated 1 year ago

Feedback? Help us improve.