llm.c  by karpathy

LLM training in pure C/CUDA, no PyTorch needed

created 1 year ago
27,261 stars

Top 1.4% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides LLM training and inference capabilities implemented in pure C and CUDA, aiming for simplicity and performance without heavy dependencies like PyTorch or Python. It's designed for developers and researchers interested in understanding LLM internals, optimizing performance, or porting LLM functionality to environments where Python is not feasible. The project focuses on reproducing GPT-2 and GPT-3 models, offering a faster alternative to PyTorch for certain operations.

How It Works

The project leverages raw C and CUDA for maximum control and performance. It implements core LLM components like attention mechanisms and feed-forward networks directly in C and CUDA kernels. This approach allows for fine-grained optimization, bypassing higher-level abstractions and potentially achieving faster execution speeds, as demonstrated by performance claims exceeding PyTorch Nightly. The codebase is structured to be educational, with well-documented kernels in the dev/cuda directory, ranging from simple to complex implementations.

Quick Start & Requirements

  • CPU Training: chmod u+x ./dev/download_starter_pack.sh && ./dev/download_starter_pack.sh && make train_gpt2 && OMP_NUM_THREADS=8 ./train_gpt2
  • GPU Training (FP32): chmod u+x ./dev/download_starter_pack.sh && ./dev/download_starter_pack.sh && make train_gpt2cu && ./train_gpt2cu
  • Prerequisites: CUDA Toolkit (for GPU), cuDNN (optional, for Flash Attention), MPI and NCCL (for multi-GPU/multi-node). Python is used for data preparation scripts.
  • Resources: Requires a CUDA-enabled GPU for GPU training. CPU training is possible but significantly slower.
  • Links: Quick Start Guide, Discussions, Discord (#llmc channel on Zero to Hero, #llmdotc on GPU MODE).

Highlighted Details

  • Achieves performance up to 7% faster than PyTorch Nightly.
  • Includes a simple CPU fp32 reference implementation in ~1,000 lines of C.
  • Supports multi-GPU and multi-node training via MPI, NCCL (FS, TCP init).
  • Offers a tutorial on implementing a Layernorm layer in C.

Maintenance & Community

The project is actively maintained by Andrej Karpathy. Developer coordination occurs in GitHub Discussions and on Discord. A variety of community ports to different languages and hardware are listed as notable forks.

Licensing & Compatibility

  • License: MIT
  • Compatibility: Permissive MIT license allows for commercial use and integration into closed-source projects.

Limitations & Caveats

The cuDNN-accelerated Flash Attention path is new, disabled by default, and significantly increases compile times. Multi-node training setup can be complex depending on the environment (e.g., Slurm configuration for PMIx support). CPU training is primarily for demonstration due to performance limitations.

Health Check
Last commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
2
Star History
871 stars in the last 90 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake) and Zhiqiang Xie Zhiqiang Xie(Author of SGLang).

veScale by volcengine

0.1%
839
PyTorch-native framework for LLM training
created 1 year ago
updated 3 weeks ago
Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera) and Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

InternEvo by InternLM

1.0%
402
Lightweight training framework for model pre-training
created 1 year ago
updated 1 week ago
Feedback? Help us improve.