KoBART by SKT-AI

Korean encoder-decoder language model

Created 5 years ago

466 stars

Top 65.2% on SourcePulse

Project Summary

KoBART is an encoder-decoder language model specifically trained on over 40GB of Korean text, addressing the need for a robust Korean natural language processing foundation. It is designed for researchers and developers working with Korean NLP tasks such as classification, regression, summarization, and question answering.

How It Works

KoBART is based on the BART architecture, utilizing a Text Infilling noise function for pre-training. This approach involves corrupting input text and training the model to reconstruct the original, enabling it to learn bidirectional context and generative capabilities. It employs a Character BPE tokenizer with a vocabulary size of 30,000, augmented with common emoticons and unused tokens for custom subtask definitions.

Quick Start & Requirements

Install via pip: pip install git+https://github.com/SKT-AI/KoBART#egg=kobart
Requires Python.
Official documentation and demos are available via links in the README.

Highlighted Details

Achieves 90.24% accuracy on NSMC classification and 81.66% Spearman correlation on KorSTS.
Uses a Character BPE tokenizer with 30,000 vocabulary size, including emoticons.
KoBART-base model has 124M parameters with 6 encoder/decoder layers.
Supports custom subtask token definitions.

Maintenance & Community

Issues can be reported via the provided link in the README.
Release history indicates ongoing development and bug fixes.

Licensing & Compatibility

Released under a modified MIT license.
Users must comply with the license terms for model and code usage.

Limitations & Caveats

Summarization performance updates are planned.
The model is primarily focused on Korean language tasks.

Health Check

Last Commit

7 months ago

Responsiveness

1 day

Pull Requests (30d)

0

Issues (30d)

0

Star History

0 stars in the last 30 days

Explore Similar Projects

Mengzi3 by Langboat

LLM for multilingual generation, especially Chinese

Created 1 year ago

Updated 1 year ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind).

YAYI2 by wenge-research

Chinese LLM for research, base and chat versions, 30B parameters

Created 2 years ago

Updated 1 year ago

SkyText-Chinese-GPT3 by SkyWorkAIGC

Chinese GPT3 pre-trained language model

Created 3 years ago

Updated 2 years ago

Starred by

Andrew Kane

Andrew Kane(Author of pgvector).

text2text by artitw

Text2Text toolkit for language modeling tasks

Created 5 years ago

Updated 1 year ago

LMkor by kiyoungkim1

Korean language models for NLP tasks

Created 5 years ago

Updated 3 years ago

Modelscope_Faster_Whisper_Multi_Subtitle by v3ucn

Subtitle generator for offline bilingual transcription

Created 1 year ago

Updated 1 year ago

Starred by

Lysandre Debut

Lysandre Debut(Chief Open-Source Officer at Hugging Face).

bert-japanese by cl-tohoku

Pretrained BERT models for Japanese text

Created 6 years ago

Updated 1 year ago

KoGPT2 by SKT-AI

Korean GPT-2 model for text generation

Created 6 years ago

Updated 1 year ago

KoELECTRA by monologg

Pretrained ELECTRA model for Korean language tasks

Created 5 years ago

Updated 1 year ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind).

kogpt by kakaobrain

Korean generative pre-trained transformer for classifying, searching, summarizing, or generating Korean texts

Created 4 years ago

Updated 1 year ago

Starred by

Nat Friedman

Nat Friedman(Former CEO of GitHub),

Eric Zhang

Eric Zhang(Founding Engineer at Modal), and

31 more.

tiktoken by openai

Fast BPE tokenizer for OpenAI models

Created 3 years ago

Updated 3 months ago

Starred by

Clement Delangue

Clement Delangue(Cofounder of Hugging Face) and

Shizhe Diao

Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA).

GPT2-Chinese by Morizeyao

GPT2 training code for Chinese language models

Created 6 years ago

Updated 1 year ago

Feedback? Help us improve.