mini_qwen by qiufengqijun

LLM project for training a large language model from scratch

Created 11 months ago

721 stars

Top 47.8% on SourcePulse

Project Summary

This project provides a comprehensive guide and codebase for training a 1 billion parameter large language model (LLM) from scratch, covering pre-training, supervised fine-tuning (SFT), and direct preference optimization (DPO). It targets researchers and developers interested in understanding and replicating the LLM training pipeline with resource constraints, demonstrating that training is feasible even with consumer-grade GPUs like the T4.

How It Works

The project builds upon the Qwen2.5-0.5B-Instruct model, expanding its architecture and initializing parameters randomly. It utilizes a curated dataset of 16B tokens for pre-training, 9M examples for SFT, and 60K for DPO, sourced from reputable institutions. Training employs flash_attention_2 for acceleration and DeepSpeed for distributed training, achieving efficient training runs on multiple H800 GPUs. The project also explores concepts like scaling laws, the "repeater phenomenon," and knowledge injection during fine-tuning.

Quick Start & Requirements

Install dependencies: pip install flash-attn trl==0.11.4 transformers==4.45.0
Run example scripts: python demo/demo_pt.py, python demo/demo_sft.py, python demo/demo_dpo.py
Requires Python, PyTorch, CUDA. Flash-attention compatibility with PyTorch/CUDA versions is crucial.
Official documentation and examples are available within the repository.

Highlighted Details

Pre-training and SFT are achievable with as little as 12GB VRAM.
DPO training requires approximately 14GB VRAM.
Detailed logs and analysis of training stages, including loss curves and model evaluations.
Explores the "repeater phenomenon" and its potential causes and mitigation strategies.
Investigates the impact of mixed-language data and scaling laws on model performance.

Maintenance & Community

The project is actively maintained by qiufengqijun. Community interaction and discussion are encouraged.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Users should verify licensing for commercial or closed-source use.

Limitations & Caveats

The project notes that DPO did not significantly improve performance and may even degrade it in some configurations, suggesting careful hyperparameter tuning and data quality are critical for RLHF stages. The "repeater phenomenon" persists even in fine-tuned models, though it is somewhat mitigated. The project also highlights potential compatibility issues with specific versions of trl and transformers, and the need to match flash-attn with the correct PyTorch and CUDA versions.

mini_qwen by qiufengqijun

Explore Similar Projects

MegaDLMs by JinjieNi

Instella by AMD-AGI

fms-fsdp by foundation-model-stack

mistral by stanford-crfm

megalodon by XuezheMax

InternEvo by InternLM

MINI_LLM by jiahe7ay

OpenMoE by XueFuzhao

tiny-llm-zh by wdndev

LLMBox by RUCAIBox

FlagAI by FlagAI-Open

trlx by CarperAI