LLM training in pure C/CUDA, no PyTorch needed
Top 1.4% on sourcepulse
This repository provides LLM training and inference capabilities implemented in pure C and CUDA, aiming for simplicity and performance without heavy dependencies like PyTorch or Python. It's designed for developers and researchers interested in understanding LLM internals, optimizing performance, or porting LLM functionality to environments where Python is not feasible. The project focuses on reproducing GPT-2 and GPT-3 models, offering a faster alternative to PyTorch for certain operations.
How It Works
The project leverages raw C and CUDA for maximum control and performance. It implements core LLM components like attention mechanisms and feed-forward networks directly in C and CUDA kernels. This approach allows for fine-grained optimization, bypassing higher-level abstractions and potentially achieving faster execution speeds, as demonstrated by performance claims exceeding PyTorch Nightly. The codebase is structured to be educational, with well-documented kernels in the dev/cuda
directory, ranging from simple to complex implementations.
Quick Start & Requirements
chmod u+x ./dev/download_starter_pack.sh && ./dev/download_starter_pack.sh && make train_gpt2 && OMP_NUM_THREADS=8 ./train_gpt2
chmod u+x ./dev/download_starter_pack.sh && ./dev/download_starter_pack.sh && make train_gpt2cu && ./train_gpt2cu
Highlighted Details
Maintenance & Community
The project is actively maintained by Andrej Karpathy. Developer coordination occurs in GitHub Discussions and on Discord. A variety of community ports to different languages and hardware are listed as notable forks.
Licensing & Compatibility
Limitations & Caveats
The cuDNN-accelerated Flash Attention path is new, disabled by default, and significantly increases compile times. Multi-node training setup can be complex depending on the environment (e.g., Slurm configuration for PMIx support). CPU training is primarily for demonstration due to performance limitations.
1 month ago
Inactive