LLM pre-training reproduction repo for experimentation
Top 67.0% on sourcepulse
This repository provides a framework for reproducing the pre-training and fine-tuning (SFT, DPO) of a 1.4B parameter Chinese Large Language Model. It's designed for individuals and researchers interested in understanding and experimenting with the end-to-end LLM development pipeline, leveraging the Qwen base model and DeepSpeed for distributed training.
How It Works
The project utilizes the Qwen 1.4B model as a base, benefiting from its established tokenizer and architecture. Pre-training involves processing approximately 8 billion tokens from datasets like Wikipedia-CN, BaiduBaiKe, and SkyPile-150B. Fine-tuning includes Supervised Fine-Tuning (SFT) on instruction datasets such as Alpaca-zh and Belle, followed by Direct Preference Optimization (DPO) to align the model's outputs with desired preferences, using a specific data formatting strategy for chosen and rejected responses.
Quick Start & Requirements
train.sh
script for pre-training and fine-tuning.generate_data.py
for data preprocessing, then train.sh
with appropriate configuration files (accelerate_one_gpu.yaml
, accelerate_multi_gpu.yaml
, or accelerate_multi_gpus_on_multi_nodes.yaml
) for training. Multi-node training requires passwordless SSH and consistent environments across nodes.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The project is a personal reproduction and may not be as robust or optimized as professionally maintained libraries. The README mentions PPO is "to be done," indicating it's not yet implemented. Data preparation steps require manual modification of Python scripts.
3 months ago
1 day