KoAlpaca by Beomi

Korean LLM fine-tuning project

Created 2 years ago

1,581 stars

Top 26.3% on SourcePulse

Project Summary

KoAlpaca is an open-source language model project focused on understanding and responding to Korean instructions. It offers various models based on LLAMA and Polyglot-ko backbones, catering to researchers and developers working with Korean NLP tasks. The project provides fine-tuned models and datasets, enabling users to build Korean-specific conversational AI and instruction-following systems.

How It Works

KoAlpaca models are fine-tuned using the Stanford Alpaca methodology, adapting instruction-following techniques to Korean. The project utilizes both full fine-tuning and LoRA methods, leveraging large Korean datasets derived from sources like Naver Knowledge iN and translated Alpaca data. This approach aims to improve Korean language understanding and response generation quality compared to models trained solely on English data.

Quick Start & Requirements

Install core libraries: pip install -U torch transformers tokenizers accelerate safetensors
Inference example provided using Hugging Face pipeline.
Training examples for QLoRA and FSDP are available, requiring significant GPU resources (e.g., multiple A100s or RTX 3090/4090s).
Official Hugging Face models are available for direct use.

Highlighted Details

Offers models based on Polyglot-ko (5.8B, 12.8B) and LLAMA (7B, 13B, 30B, 65B) backbones.
Provides both full fine-tuned and LoRA-adapted weights.
Includes a Korean instruction-following dataset (v1.1) generated from Naver Knowledge iN.
Demonstrates improved performance on NSMC benchmark compared to base Polyglot-ko.

Maintenance & Community

Project activity and updates are logged in the README.
Links to Hugging Face repositories for models are provided.
No explicit community channels (Discord/Slack) are mentioned.

Licensing & Compatibility

KoAlpaca models are typically released under licenses compatible with their base models (e.g., LLaMA's license).
The dataset is available for research purposes.
Commercial use depends on the underlying base model licenses.

Limitations & Caveats

LLAMA-based models may exhibit lower Korean performance due to insufficient Korean pre-training.
Training and inference for larger models require substantial GPU memory and compute.
The web demo service has been discontinued as of May 2024.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

4 stars in the last 30 days