LLM pretraining/SFT repo for small Chinese Llama2 models
Top 17.2% on sourcepulse
This repository provides a framework for pre-training and fine-tuning small-parameter Chinese Llama 2 models, targeting LLM beginners. It offers a complete pipeline from data processing to model evaluation, enabling users to train a functional Chinese chatbot with as little as 24GB of VRAM.
How It Works
The project leverages the ChatGLM2-6B tokenizer for its efficient 64k vocabulary size, which is optimal for Chinese text. It supports pre-training on large Chinese corpora (up to 63.4 billion tokens) and fine-tuning using instruction datasets like Alpaca-Zh and medical domain data. The approach emphasizes full fine-tuning due to the model's smaller parameter count, with plans to incorporate parameter-efficient methods for larger models.
Quick Start & Requirements
./data/
, modify data_process.py
, and run python data_process.py
to create pretrain_data.bin
.pretrain.py
based on available hardware (e.g., 4x 3090).torchrun --standalone --nproc_per_node=4 pretrain.py
.python sft_data_process.py
.python sft.py
.python eval.py
.
screen
for background execution.Highlighted Details
Llama2-Chinese-218M-v3-MedicalChat
).Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
1 year ago
1 day