Korean encoder-decoder language model
Top 66.4% on sourcepulse
KoBART is an encoder-decoder language model specifically trained on over 40GB of Korean text, addressing the need for a robust Korean natural language processing foundation. It is designed for researchers and developers working with Korean NLP tasks such as classification, regression, summarization, and question answering.
How It Works
KoBART is based on the BART architecture, utilizing a Text Infilling noise function for pre-training. This approach involves corrupting input text and training the model to reconstruct the original, enabling it to learn bidirectional context and generative capabilities. It employs a Character BPE tokenizer with a vocabulary size of 30,000, augmented with common emoticons and unused tokens for custom subtask definitions.
Quick Start & Requirements
pip install git+https://github.com/SKT-AI/KoBART#egg=kobart
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
1 month ago
1 day