KoBART  by SKT-AI

Korean encoder-decoder language model

created 4 years ago
463 stars

Top 66.4% on sourcepulse

GitHubView on GitHub
Project Summary

KoBART is an encoder-decoder language model specifically trained on over 40GB of Korean text, addressing the need for a robust Korean natural language processing foundation. It is designed for researchers and developers working with Korean NLP tasks such as classification, regression, summarization, and question answering.

How It Works

KoBART is based on the BART architecture, utilizing a Text Infilling noise function for pre-training. This approach involves corrupting input text and training the model to reconstruct the original, enabling it to learn bidirectional context and generative capabilities. It employs a Character BPE tokenizer with a vocabulary size of 30,000, augmented with common emoticons and unused tokens for custom subtask definitions.

Quick Start & Requirements

  • Install via pip: pip install git+https://github.com/SKT-AI/KoBART#egg=kobart
  • Requires Python.
  • Official documentation and demos are available via links in the README.

Highlighted Details

  • Achieves 90.24% accuracy on NSMC classification and 81.66% Spearman correlation on KorSTS.
  • Uses a Character BPE tokenizer with 30,000 vocabulary size, including emoticons.
  • KoBART-base model has 124M parameters with 6 encoder/decoder layers.
  • Supports custom subtask token definitions.

Maintenance & Community

  • Issues can be reported via the provided link in the README.
  • Release history indicates ongoing development and bug fixes.

Licensing & Compatibility

  • Released under a modified MIT license.
  • Users must comply with the license terms for model and code usage.

Limitations & Caveats

  • Summarization performance updates are planned.
  • The model is primarily focused on Korean language tasks.
Health Check
Last commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 90 days

Explore Similar Projects

Starred by Ying Sheng Ying Sheng(Author of SGLang) and Jared Palmer Jared Palmer(Ex-VP of AI at Vercel; Founder of Turborepo; Author of Formik, TSDX).

xgen by salesforce

0%
720
LLM research release with 8k sequence length
created 2 years ago
updated 6 months ago
Feedback? Help us improve.