mini_qwen  by qiufengqijun

LLM project for training a large language model from scratch

created 6 months ago
530 stars

Top 60.5% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a comprehensive guide and codebase for training a 1 billion parameter large language model (LLM) from scratch, covering pre-training, supervised fine-tuning (SFT), and direct preference optimization (DPO). It targets researchers and developers interested in understanding and replicating the LLM training pipeline with resource constraints, demonstrating that training is feasible even with consumer-grade GPUs like the T4.

How It Works

The project builds upon the Qwen2.5-0.5B-Instruct model, expanding its architecture and initializing parameters randomly. It utilizes a curated dataset of 16B tokens for pre-training, 9M examples for SFT, and 60K for DPO, sourced from reputable institutions. Training employs flash_attention_2 for acceleration and DeepSpeed for distributed training, achieving efficient training runs on multiple H800 GPUs. The project also explores concepts like scaling laws, the "repeater phenomenon," and knowledge injection during fine-tuning.

Quick Start & Requirements

  • Install dependencies: pip install flash-attn trl==0.11.4 transformers==4.45.0
  • Run example scripts: python demo/demo_pt.py, python demo/demo_sft.py, python demo/demo_dpo.py
  • Requires Python, PyTorch, CUDA. Flash-attention compatibility with PyTorch/CUDA versions is crucial.
  • Official documentation and examples are available within the repository.

Highlighted Details

  • Pre-training and SFT are achievable with as little as 12GB VRAM.
  • DPO training requires approximately 14GB VRAM.
  • Detailed logs and analysis of training stages, including loss curves and model evaluations.
  • Explores the "repeater phenomenon" and its potential causes and mitigation strategies.
  • Investigates the impact of mixed-language data and scaling laws on model performance.

Maintenance & Community

The project is actively maintained by qiufengqijun. Community interaction and discussion are encouraged.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Users should verify licensing for commercial or closed-source use.

Limitations & Caveats

The project notes that DPO did not significantly improve performance and may even degrade it in some configurations, suggesting careful hyperparameter tuning and data quality are critical for RLHF stages. The "repeater phenomenon" persists even in fine-tuned models, though it is somewhat mitigated. The project also highlights potential compatibility issues with specific versions of trl and transformers, and the need to match flash-attn with the correct PyTorch and CUDA versions.

Health Check
Last commit

5 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
4
Star History
167 stars in the last 90 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), George Hotz George Hotz(Author of tinygrad; Founder of the tiny corp, comma.ai), and
10 more.

TinyLlama by jzhang38

0.3%
9k
Tiny pretraining project for a 1.1B Llama model
created 1 year ago
updated 1 year ago
Feedback? Help us improve.