diy-llm  by datawhalechina

LLM construction course for hands-on system building

Created 4 months ago
463 stars

Top 65.3% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

This project offers a systematic, code-driven curriculum for building Large Language Models (LLMs), specifically tailored for Chinese learners. It bridges theoretical understanding with practical implementation, enabling users to construct their own LLMs from scratch. The course provides valuable engineering experience and a strong foundation for LLM research and development, targeting individuals with existing Python and deep learning knowledge.

How It Works

Adapting Stanford's CS336, this project reconstructs the LLM knowledge system for Chinese speakers with a hands-on coding focus. It breaks down LLM construction into six progressive assignments, covering core components from tokenization and Transformer architectures to distributed training, inference, and alignment. The approach emphasizes "thinking with code" and provides practical, localized solutions relevant to the Chinese tech ecosystem.

Quick Start & Requirements

Clone the repository (git clone https://github.com/datawhalechina/diy-llm.git) and install base dependencies. Prerequisites include proficient Python, PyTorch, deep learning fundamentals, and math basics. GPU programming (CUDA) is beneficial; tutorials are included. Full training requires GPU resources; cloud platforms are recommended. Online documentation: https://datawhalechina.github.io/diy-llm/.

Highlighted Details

  • Comprehensive curriculum: data engineering, tokenizers, Transformer/MoE, GPU programming (CUDA/Triton), distributed training, scaling laws, inference, and alignment (SFT/RLHF/GRPO).
  • Six hands-on assignments: implementing a minimal LLM, system optimization (FlashAttention-2), distributed training, data processing, model alignment (SFT/GRPO), and multi-dimensional evaluation.
  • Localized content for Chinese learners, featuring Chinese explanations, code examples, and references to domestic models (Qwen, DeepSeek).
  • Advanced topics: GPU high-performance programming with Triton, reinforcement learning for alignment.

Maintenance & Community

Led by Datawhale members, the project welcomes community contributions via GitHub Issues for bug reports, code, and documentation improvements. Specific chat links are not provided.

Licensing & Compatibility

Licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0). This license strictly prohibits commercial use, posing a significant limitation for adoption in commercial products.

Limitations & Caveats

Complete LLM training requires substantial GPU resources, making it impractical on CPU-only setups. The CC BY-NC-SA 4.0 license's non-commercial clause is a critical adoption blocker for commercial entities. Some documentation sections are marked as "待完善" (to be improved) or "更新中" (updating).

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
9
Issues (30d)
2
Star History
410 stars in the last 30 days

Explore Similar Projects

Starred by Maxime Labonne Maxime Labonne(Head of Post-Training at Liquid AI), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
19 more.

llm-course by mlabonne

0.5%
78k
LLM course with roadmaps and notebooks
Created 2 years ago
Updated 2 months ago
Feedback? Help us improve.