MINI_LLM by jiahe7ay

LLM pre-training reproduction repo for experimentation

Created 1 year ago

487 stars

Top 63.3% on SourcePulse

Project Summary

This repository provides a framework for reproducing the pre-training and fine-tuning (SFT, DPO) of a 1.4B parameter Chinese Large Language Model. It's designed for individuals and researchers interested in understanding and experimenting with the end-to-end LLM development pipeline, leveraging the Qwen base model and DeepSpeed for distributed training.

How It Works

The project utilizes the Qwen 1.4B model as a base, benefiting from its established tokenizer and architecture. Pre-training involves processing approximately 8 billion tokens from datasets like Wikipedia-CN, BaiduBaiKe, and SkyPile-150B. Fine-tuning includes Supervised Fine-Tuning (SFT) on instruction datasets such as Alpaca-zh and Belle, followed by Direct Preference Optimization (DPO) to align the model's outputs with desired preferences, using a specific data formatting strategy for chosen and rejected responses.

Quick Start & Requirements

Installation: Clone the repository and follow the train.sh script for pre-training and fine-tuning.
Data: Download pre-training corpora (Wikipedia-CN, BaiduBaiKe, SkyPile-150B) and SFT datasets (Alpaca-zh, Belle) from provided links.
Execution: Run generate_data.py for data preprocessing, then train.sh with appropriate configuration files (accelerate_one_gpu.yaml, accelerate_multi_gpu.yaml, or accelerate_multi_gpus_on_multi_nodes.yaml) for training. Multi-node training requires passwordless SSH and consistent environments across nodes.
Resources: Requires significant computational resources, including GPUs for distributed training. Specific configuration files are provided for single-GPU, multi-GPU, and multi-node setups.
Links: Pre-training corpus sources: Skywork, BelleGroup.

Highlighted Details

Implements a full LLM pipeline: pre-training, SFT, and DPO.
Leverages DeepSpeed for efficient distributed training.
Uses Qwen 1.4B as the base model.
Supports multiple Chinese instruction datasets for SFT.
Provides detailed steps for multi-node, multi-GPU training setup.

Maintenance & Community

The project is a personal endeavor by Lil2J, referencing other open-source projects.
Community interaction is encouraged via WeChat (wx:ForeverM1LAn).
Model weights for pre-trained, SFT, and DPO stages are available on Hugging Face.

Licensing & Compatibility

The repository does not explicitly state a license.
Model weights are available under unspecified licenses. Compatibility for commercial or closed-source use is not detailed.

Limitations & Caveats

The project is a personal reproduction and may not be as robust or optimized as professionally maintained libraries. The README mentions PPO is "to be done," indicating it's not yet implemented. Data preparation steps require manual modification of Python scripts.

MINI_LLM by jiahe7ay

Explore Similar Projects

LLaMA-Cult-and-More by shm007g

Zero-Chatgpt by AI-Study-Han

RLT by SakanaAI

libai by Oneflow-Inc

OpenMoE by XueFuzhao

MAP-NEO by multimodal-art-projection

mini_qwen by qiufengqijun

train-llm-from-scratch by FareedKhan-dev

tiny-llm-zh by wdndev

Megatron-DeepSpeed by bigscience-workshop

lingua by facebookresearch

oumi by oumi-ai