build_MiniLLM_from_scratch  by Tongjilibo

MiniLLM build from scratch (pretrain+sft+dpo)

Created 1 year ago
475 stars

Top 64.4% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a practical guide and framework for building a small-scale Large Language Model (LLM) from scratch, covering pre-training, supervised fine-tuning (SFT), and Direct Preference Optimization (DPO). It targets researchers and developers interested in understanding and replicating the LLM training pipeline with manageable resources, offering reproducible results and readily usable checkpoints.

How It Works

The project leverages the bert4torch and torch4keras training frameworks, designed for concise and efficient LLM development. It emphasizes seamless integration with the Hugging Face transformers library for inference, optimized data loading for reduced memory footprint, and comprehensive logging for reproducibility. The approach allows for the creation of chat-capable models with customizable attributes like robot names.

Quick Start & Requirements

  • Install dependencies:
    pip install git+https://github.com/Tongjilibo/torch4keras.git
    pip install git+https://github.com/Tongjilibo/bert4torch.git@dev
    
  • Training requires torchrun and potentially disabling NCCL for distributed training (export NCCL_IB_DISABLE=1).
  • Inference can be performed using provided infer.py scripts or converted checkpoints with transformers.
  • Official quick-start and model details are available in the README.

Highlighted Details

  • Offers pre-trained and SFT models in 0.2B and 1.1B parameter sizes.
  • Supports multi-turn dialogue capabilities in SFT models.
  • Provides detailed training logs and hardware requirements for reproducibility.
  • Includes a variety of Chinese pre-training and SFT datasets.

Maintenance & Community

The project is actively maintained, with recent updates including new SFT models and multi-turn dialogue support. A WeChat group is available for community discussion (invitation required).

Licensing & Compatibility

The repository's licensing is not explicitly stated in the provided README. Compatibility for commercial use or closed-source linking would require clarification on the specific license terms.

Limitations & Caveats

The project explicitly states that current models possess only basic chat functionality and are not capable of answering complex questions due to limitations in corpus size, model scale, and SFT data quality. The DPO stage is still under testing.

Health Check
Last Commit

5 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
11 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.