diy-llm by datawhalechina

LLM construction course for hands-on system building

Created 7 months ago

1,034 stars

Top 35.6% on SourcePulse

Project Summary

Summary

This project offers a systematic, code-driven curriculum for building Large Language Models (LLMs), specifically tailored for Chinese learners. It bridges theoretical understanding with practical implementation, enabling users to construct their own LLMs from scratch. The course provides valuable engineering experience and a strong foundation for LLM research and development, targeting individuals with existing Python and deep learning knowledge.

How It Works

Adapting Stanford's CS336, this project reconstructs the LLM knowledge system for Chinese speakers with a hands-on coding focus. It breaks down LLM construction into six progressive assignments, covering core components from tokenization and Transformer architectures to distributed training, inference, and alignment. The approach emphasizes "thinking with code" and provides practical, localized solutions relevant to the Chinese tech ecosystem.

Quick Start & Requirements

Clone the repository (git clone https://github.com/datawhalechina/diy-llm.git) and install base dependencies. Prerequisites include proficient Python, PyTorch, deep learning fundamentals, and math basics. GPU programming (CUDA) is beneficial; tutorials are included. Full training requires GPU resources; cloud platforms are recommended. Online documentation: https://datawhalechina.github.io/diy-llm/.

Highlighted Details

Comprehensive curriculum: data engineering, tokenizers, Transformer/MoE, GPU programming (CUDA/Triton), distributed training, scaling laws, inference, and alignment (SFT/RLHF/GRPO).
Six hands-on assignments: implementing a minimal LLM, system optimization (FlashAttention-2), distributed training, data processing, model alignment (SFT/GRPO), and multi-dimensional evaluation.
Localized content for Chinese learners, featuring Chinese explanations, code examples, and references to domestic models (Qwen, DeepSeek).
Advanced topics: GPU high-performance programming with Triton, reinforcement learning for alignment.

Maintenance & Community

Led by Datawhale members, the project welcomes community contributions via GitHub Issues for bug reports, code, and documentation improvements. Specific chat links are not provided.

Licensing & Compatibility

Licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0). This license strictly prohibits commercial use, posing a significant limitation for adoption in commercial products.

Limitations & Caveats

Complete LLM training requires substantial GPU resources, making it impractical on CPU-only setups. The CC BY-NC-SA 4.0 license's non-commercial clause is a critical adoption blocker for commercial entities. Some documentation sections are marked as "待完善" (to be improved) or "更新中" (updating).

diy-llm by datawhalechina

Explore Similar Projects

cs336_note_and_hw by weiruihhh

llms-from-scratch-rs by nerdai

LLM-Travel by Glanvery

llm-algo-leetcode by datawhalechina

lightron by lwj2015

from-minimind-to-more by Tongyun1

MAP-NEO by multimodal-art-projection

IQuest-Coder-V1 by IQuestLab

base-llm by datawhalechina

LLM-PowerHouse-A-Curated-Guide-for-Large-Language-Models-with-Custom-Training-and-Inferencing by ghimiresunil

LLM-workshop-2024 by rasbt

llm-course by mlabonne