MINI_LLM  by jiahe7ay

LLM pre-training reproduction repo for experimentation

created 1 year ago
458 stars

Top 67.0% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a framework for reproducing the pre-training and fine-tuning (SFT, DPO) of a 1.4B parameter Chinese Large Language Model. It's designed for individuals and researchers interested in understanding and experimenting with the end-to-end LLM development pipeline, leveraging the Qwen base model and DeepSpeed for distributed training.

How It Works

The project utilizes the Qwen 1.4B model as a base, benefiting from its established tokenizer and architecture. Pre-training involves processing approximately 8 billion tokens from datasets like Wikipedia-CN, BaiduBaiKe, and SkyPile-150B. Fine-tuning includes Supervised Fine-Tuning (SFT) on instruction datasets such as Alpaca-zh and Belle, followed by Direct Preference Optimization (DPO) to align the model's outputs with desired preferences, using a specific data formatting strategy for chosen and rejected responses.

Quick Start & Requirements

  • Installation: Clone the repository and follow the train.sh script for pre-training and fine-tuning.
  • Data: Download pre-training corpora (Wikipedia-CN, BaiduBaiKe, SkyPile-150B) and SFT datasets (Alpaca-zh, Belle) from provided links.
  • Execution: Run generate_data.py for data preprocessing, then train.sh with appropriate configuration files (accelerate_one_gpu.yaml, accelerate_multi_gpu.yaml, or accelerate_multi_gpus_on_multi_nodes.yaml) for training. Multi-node training requires passwordless SSH and consistent environments across nodes.
  • Resources: Requires significant computational resources, including GPUs for distributed training. Specific configuration files are provided for single-GPU, multi-GPU, and multi-node setups.
  • Links: Pre-training corpus sources: Skywork, BelleGroup.

Highlighted Details

  • Implements a full LLM pipeline: pre-training, SFT, and DPO.
  • Leverages DeepSpeed for efficient distributed training.
  • Uses Qwen 1.4B as the base model.
  • Supports multiple Chinese instruction datasets for SFT.
  • Provides detailed steps for multi-node, multi-GPU training setup.

Maintenance & Community

  • The project is a personal endeavor by Lil2J, referencing other open-source projects.
  • Community interaction is encouraged via WeChat (wx:ForeverM1LAn).
  • Model weights for pre-trained, SFT, and DPO stages are available on Hugging Face.

Licensing & Compatibility

  • The repository does not explicitly state a license.
  • Model weights are available under unspecified licenses. Compatibility for commercial or closed-source use is not detailed.

Limitations & Caveats

The project is a personal reproduction and may not be as robust or optimized as professionally maintained libraries. The README mentions PPO is "to be done," indicating it's not yet implemented. Data preparation steps require manual modification of Python scripts.

Health Check
Last commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
32 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Alex Cheema Alex Cheema(Cofounder of EXO Labs), and
1 more.

recurrent-pretraining by seal-rg

0.1%
806
Pretraining code for depth-recurrent language model research
created 5 months ago
updated 2 weeks ago
Feedback? Help us improve.