build_MiniLLM_from_scratch  by Tongjilibo

MiniLLM build from scratch (pretrain+sft+dpo)

created 1 year ago
459 stars

Top 66.9% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a practical guide and framework for building a small-scale Large Language Model (LLM) from scratch, covering pre-training, supervised fine-tuning (SFT), and Direct Preference Optimization (DPO). It targets researchers and developers interested in understanding and replicating the LLM training pipeline with manageable resources, offering reproducible results and readily usable checkpoints.

How It Works

The project leverages the bert4torch and torch4keras training frameworks, designed for concise and efficient LLM development. It emphasizes seamless integration with the Hugging Face transformers library for inference, optimized data loading for reduced memory footprint, and comprehensive logging for reproducibility. The approach allows for the creation of chat-capable models with customizable attributes like robot names.

Quick Start & Requirements

  • Install dependencies:
    pip install git+https://github.com/Tongjilibo/torch4keras.git
    pip install git+https://github.com/Tongjilibo/bert4torch.git@dev
    
  • Training requires torchrun and potentially disabling NCCL for distributed training (export NCCL_IB_DISABLE=1).
  • Inference can be performed using provided infer.py scripts or converted checkpoints with transformers.
  • Official quick-start and model details are available in the README.

Highlighted Details

  • Offers pre-trained and SFT models in 0.2B and 1.1B parameter sizes.
  • Supports multi-turn dialogue capabilities in SFT models.
  • Provides detailed training logs and hardware requirements for reproducibility.
  • Includes a variety of Chinese pre-training and SFT datasets.

Maintenance & Community

The project is actively maintained, with recent updates including new SFT models and multi-turn dialogue support. A WeChat group is available for community discussion (invitation required).

Licensing & Compatibility

The repository's licensing is not explicitly stated in the provided README. Compatibility for commercial use or closed-source linking would require clarification on the specific license terms.

Limitations & Caveats

The project explicitly states that current models possess only basic chat functionality and are not capable of answering complex questions due to limitations in corpus size, model scale, and SFT data quality. The DPO stage is still under testing.

Health Check
Last commit

4 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
38 stars in the last 90 days

Explore Similar Projects

Starred by Lukas Biewald Lukas Biewald(Cofounder of Weights & Biases), Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), and
1 more.

DialoGPT by microsoft

0.0%
2k
Response generation model via large-scale pretraining
created 6 years ago
updated 2 years ago
Feedback? Help us improve.