MINI_LLM  by jiahe7ay

LLM pre-training reproduction repo for experimentation

Created 1 year ago
471 stars

Top 64.8% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a framework for reproducing the pre-training and fine-tuning (SFT, DPO) of a 1.4B parameter Chinese Large Language Model. It's designed for individuals and researchers interested in understanding and experimenting with the end-to-end LLM development pipeline, leveraging the Qwen base model and DeepSpeed for distributed training.

How It Works

The project utilizes the Qwen 1.4B model as a base, benefiting from its established tokenizer and architecture. Pre-training involves processing approximately 8 billion tokens from datasets like Wikipedia-CN, BaiduBaiKe, and SkyPile-150B. Fine-tuning includes Supervised Fine-Tuning (SFT) on instruction datasets such as Alpaca-zh and Belle, followed by Direct Preference Optimization (DPO) to align the model's outputs with desired preferences, using a specific data formatting strategy for chosen and rejected responses.

Quick Start & Requirements

  • Installation: Clone the repository and follow the train.sh script for pre-training and fine-tuning.
  • Data: Download pre-training corpora (Wikipedia-CN, BaiduBaiKe, SkyPile-150B) and SFT datasets (Alpaca-zh, Belle) from provided links.
  • Execution: Run generate_data.py for data preprocessing, then train.sh with appropriate configuration files (accelerate_one_gpu.yaml, accelerate_multi_gpu.yaml, or accelerate_multi_gpus_on_multi_nodes.yaml) for training. Multi-node training requires passwordless SSH and consistent environments across nodes.
  • Resources: Requires significant computational resources, including GPUs for distributed training. Specific configuration files are provided for single-GPU, multi-GPU, and multi-node setups.
  • Links: Pre-training corpus sources: Skywork, BelleGroup.

Highlighted Details

  • Implements a full LLM pipeline: pre-training, SFT, and DPO.
  • Leverages DeepSpeed for efficient distributed training.
  • Uses Qwen 1.4B as the base model.
  • Supports multiple Chinese instruction datasets for SFT.
  • Provides detailed steps for multi-node, multi-GPU training setup.

Maintenance & Community

  • The project is a personal endeavor by Lil2J, referencing other open-source projects.
  • Community interaction is encouraged via WeChat (wx:ForeverM1LAn).
  • Model weights for pre-trained, SFT, and DPO stages are available on Hugging Face.

Licensing & Compatibility

  • The repository does not explicitly state a license.
  • Model weights are available under unspecified licenses. Compatibility for commercial or closed-source use is not detailed.

Limitations & Caveats

The project is a personal reproduction and may not be as robust or optimized as professionally maintained libraries. The README mentions PPO is "to be done," indicating it's not yet implemented. Data preparation steps require manual modification of Python scripts.

Health Check
Last Commit

4 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
8 stars in the last 30 days

Explore Similar Projects

Starred by Théophile Gervet Théophile Gervet(Cofounder of Genesis AI), Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), and
6 more.

lingua by facebookresearch

0.1%
5k
LLM research codebase for training and inference
Created 11 months ago
Updated 2 months ago
Feedback? Help us improve.