Discover and explore top open-source AI tools and projects—updated daily.
DLLXWLLM pretraining/SFT repo for small Chinese Llama2 models
Top 16.6% on SourcePulse
This repository provides a framework for pre-training and fine-tuning small-parameter Chinese Llama 2 models, targeting LLM beginners. It offers a complete pipeline from data processing to model evaluation, enabling users to train a functional Chinese chatbot with as little as 24GB of VRAM.
How It Works
The project leverages the ChatGLM2-6B tokenizer for its efficient 64k vocabulary size, which is optimal for Chinese text. It supports pre-training on large Chinese corpora (up to 63.4 billion tokens) and fine-tuning using instruction datasets like Alpaca-Zh and medical domain data. The approach emphasizes full fine-tuning due to the model's smaller parameter count, with plans to incorporate parameter-efficient methods for larger models.
Quick Start & Requirements
./data/, modify data_process.py, and run python data_process.py to create pretrain_data.bin.pretrain.py based on available hardware (e.g., 4x 3090).torchrun --standalone --nproc_per_node=4 pretrain.py.python sft_data_process.py.python sft.py.python eval.py.
screen for background execution.Highlighted Details
Llama2-Chinese-218M-v3-MedicalChat).Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
1 year ago
Inactive
jondurbin
hkust-nlp
multimodal-art-projection
jzhang38
oumi-ai