ru_transformers by mgrankin

GPT-2 finetuning notebook for Russian language models

Created 6 years ago

768 stars

Top 45.5% on SourcePulse

Project Summary

This repository provides tools and pre-trained models for Russian GPT-2 language generation, targeting researchers and developers interested in fine-tuning or deploying large language models for Russian text. It offers a comprehensive guide for training, evaluation, and deployment, including performance benchmarks and detailed instructions for dataset preparation and model configuration.

How It Works

The project leverages the GPT-2 architecture and implements progressive layer unfreezing for efficient transfer learning. It utilizes a custom YTTM tokenizer, noted for its speed and smaller file sizes compared to SentencePiece. Training is optimized with mixed-precision (fp16) and supports both GPU and Google TPU acceleration.

Quick Start & Requirements

Install: conda env create -f environment.yml
Prerequisites: Python, CUDA (version 10.2 mentioned for Apex), AWS CLI, apt.txt dependencies.
Resources: Training on large datasets can be resource-intensive, with TPU v3-8 being significantly faster than a Titan RTX.
Links:
- Fine-tuning Colab: https://colab.research.google.com/drive/1jwFks82BLyy8x3oxyKpiNdlL1PfKSQwW
- Corpus Generation Colab: https://colab.research.google.com/drive/1Hsp2508TXMR0ihYOLjKYOzWm9byqg9ue
- Model Downloads: https://s3.amazonaws.com/models.dobro.ai/gpt2/ru/index.html

Highlighted Details

Perplexity benchmarks provided for various model sizes (124M, 355M) and training configurations on different Russian datasets.
Supports gradual unfreezing strategy (0, 1, 2, 7, -1) for progressive training.
Includes scripts for model evaluation, text processing, and token conversion.
Offers a REST API deployment example using uvicorn.

Maintenance & Community

The project appears to be maintained by mgrankin.
Links to Telegram bots (@PorfBot, @NeuroPoetBot) for direct model interaction are provided.

Licensing & Compatibility

The repository does not explicitly state a license in the README.
Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README mentions potential issues with Apex and DataParallel (apex/issues/227), which might affect mixed-precision training on certain configurations.
Instructions for SentencePiece installation are provided but noted as skippable if using YTTM.
The project relies on AWS S3 for model distribution.

Health Check

Last Commit

5 years ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

0 stars in the last 30 days

Explore Similar Projects

gpt2-japanese by tanreinama

Japanese GPT2 model for text generation and analysis

Created 6 years ago

Updated 2 years ago

Starred by

Andrew Kane

Andrew Kane(Author of pgvector).

text2text by artitw

Text2Text toolkit for language modeling tasks

Created 5 years ago

Updated 1 year ago

Starred by

Simon Willison

Simon Willison(Coauthor of Django),

Meng Zhang

Meng Zhang(Cofounder of TabbyML), and

1 more.

LaMini-LM by mbzuai-nlp

Small, efficient language models distilled from ChatGPT for research

Created 2 years ago

Updated 2 years ago

LMkor by kiyoungkim1

Korean language models for NLP tasks

Created 5 years ago

Updated 3 years ago

Starred by

Lysandre Debut

Lysandre Debut(Chief Open-Source Officer at Hugging Face).

bert-japanese by cl-tohoku

Pretrained BERT models for Japanese text

Created 6 years ago

Updated 1 year ago

nlp-cheat-sheet-python by janlukasschroeder

A Python NLP cheat sheet covering core concepts and tools

Created 6 years ago

Updated 2 years ago

KoGPT2 by SKT-AI

Korean GPT-2 model for text generation

Created 6 years ago

Updated 1 year ago

KoELECTRA by monologg

Pretrained ELECTRA model for Korean language tasks

Created 5 years ago

Updated 1 year ago

training-fine-tuning-large-language-models-workshop-dhs2024 by dipanjanS

Workshop for training and fine-tuning large language models

Created 1 year ago

Updated 10 months ago

gpt2-ml by imcaspar

GPT-2 for multiple languages, including pretrained models

Created 6 years ago

Updated 2 years ago

Starred by

Luis Capelo

Luis Capelo(Cofounder of Lightning AI),

Eugene Yan

Eugene Yan(AI Scientist at AWS), and

14 more.

text by pytorch

PyTorch library for NLP tasks

Created 9 years ago

Updated 4 months ago

Starred by

Clement Delangue

Clement Delangue(Cofounder of Hugging Face) and

Shizhe Diao

Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA).

GPT2-Chinese by Morizeyao

GPT2 training code for Chinese language models

Created 6 years ago

Updated 1 year ago

Feedback? Help us improve.