build_MiniLLM_from_scratch by Tongjilibo

MiniLLM build from scratch (pretrain+sft+dpo)

Created 1 year ago

513 stars

Top 61.0% on SourcePulse

Project Summary

This repository provides a practical guide and framework for building a small-scale Large Language Model (LLM) from scratch, covering pre-training, supervised fine-tuning (SFT), and Direct Preference Optimization (DPO). It targets researchers and developers interested in understanding and replicating the LLM training pipeline with manageable resources, offering reproducible results and readily usable checkpoints.

How It Works

The project leverages the bert4torch and torch4keras training frameworks, designed for concise and efficient LLM development. It emphasizes seamless integration with the Hugging Face transformers library for inference, optimized data loading for reduced memory footprint, and comprehensive logging for reproducibility. The approach allows for the creation of chat-capable models with customizable attributes like robot names.

Quick Start & Requirements

Install dependencies:

pip install git+https://github.com/Tongjilibo/torch4keras.git
pip install git+https://github.com/Tongjilibo/bert4torch.git@dev

Training requires torchrun and potentially disabling NCCL for distributed training (export NCCL_IB_DISABLE=1).
Inference can be performed using provided infer.py scripts or converted checkpoints with transformers.
Official quick-start and model details are available in the README.

Highlighted Details

Offers pre-trained and SFT models in 0.2B and 1.1B parameter sizes.
Supports multi-turn dialogue capabilities in SFT models.
Provides detailed training logs and hardware requirements for reproducibility.
Includes a variety of Chinese pre-training and SFT datasets.

Maintenance & Community

The project is actively maintained, with recent updates including new SFT models and multi-turn dialogue support. A WeChat group is available for community discussion (invitation required).

Licensing & Compatibility

The repository's licensing is not explicitly stated in the provided README. Compatibility for commercial use or closed-source linking would require clarification on the specific license terms.

Limitations & Caveats

The project explicitly states that current models possess only basic chat functionality and are not capable of answering complex questions due to limitations in corpus size, model scale, and SFT data quality. The DPO stage is still under testing.

Health Check

Last Commit

9 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

9 stars in the last 30 days