mini_qwen  by qiufengqijun

LLM project for training a large language model from scratch

Created 8 months ago
596 stars

Top 54.8% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides a comprehensive guide and codebase for training a 1 billion parameter large language model (LLM) from scratch, covering pre-training, supervised fine-tuning (SFT), and direct preference optimization (DPO). It targets researchers and developers interested in understanding and replicating the LLM training pipeline with resource constraints, demonstrating that training is feasible even with consumer-grade GPUs like the T4.

How It Works

The project builds upon the Qwen2.5-0.5B-Instruct model, expanding its architecture and initializing parameters randomly. It utilizes a curated dataset of 16B tokens for pre-training, 9M examples for SFT, and 60K for DPO, sourced from reputable institutions. Training employs flash_attention_2 for acceleration and DeepSpeed for distributed training, achieving efficient training runs on multiple H800 GPUs. The project also explores concepts like scaling laws, the "repeater phenomenon," and knowledge injection during fine-tuning.

Quick Start & Requirements

  • Install dependencies: pip install flash-attn trl==0.11.4 transformers==4.45.0
  • Run example scripts: python demo/demo_pt.py, python demo/demo_sft.py, python demo/demo_dpo.py
  • Requires Python, PyTorch, CUDA. Flash-attention compatibility with PyTorch/CUDA versions is crucial.
  • Official documentation and examples are available within the repository.

Highlighted Details

  • Pre-training and SFT are achievable with as little as 12GB VRAM.
  • DPO training requires approximately 14GB VRAM.
  • Detailed logs and analysis of training stages, including loss curves and model evaluations.
  • Explores the "repeater phenomenon" and its potential causes and mitigation strategies.
  • Investigates the impact of mixed-language data and scaling laws on model performance.

Maintenance & Community

The project is actively maintained by qiufengqijun. Community interaction and discussion are encouraged.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Users should verify licensing for commercial or closed-source use.

Limitations & Caveats

The project notes that DPO did not significantly improve performance and may even degrade it in some configurations, suggesting careful hyperparameter tuning and data quality are critical for RLHF stages. The "repeater phenomenon" persists even in fine-tuned models, though it is somewhat mitigated. The project also highlights potential compatibility issues with specific versions of trl and transformers, and the need to match flash-attn with the correct PyTorch and CUDA versions.

Health Check
Last Commit

7 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
47 stars in the last 30 days

Explore Similar Projects

Starred by Wing Lian Wing Lian(Founder of Axolotl AI) and Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake).

fms-fsdp by foundation-model-stack

0.4%
265
Efficiently train foundation models with PyTorch
Created 1 year ago
Updated 1 month ago
Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera) and Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake).

InternEvo by InternLM

0.2%
407
Lightweight training framework for model pre-training
Created 1 year ago
Updated 4 weeks ago
Starred by Nat Friedman Nat Friedman(Former CEO of GitHub), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
19 more.

trlx by CarperAI

0.0%
5k
Distributed RLHF for LLMs
Created 3 years ago
Updated 1 year ago
Feedback? Help us improve.