C/CUDA implementation for Llama 3 model
Top 82.6% on sourcepulse
This repository provides a pure C/CUDA implementation of the Llama 3 model, targeting researchers and developers seeking high-performance inference without complex dependencies. It offers a significant speedup over CPU-based implementations, enabling faster experimentation and deployment of Llama 3.
How It Works
The implementation leverages pure C and CUDA for maximum performance and minimal dependencies. It builds upon the foundational work of llama2.c for the model and tokenizer logic, and incorporates CUDA kernels from rogerallen and ankan-ban. The design prioritizes a single-file, dependency-free structure for ease of compilation and use, with careful attention to reducing floating-point errors to match NumPy implementation results.
Quick Start & Requirements
make
./runcuda "I have a dream"
Highlighted Details
Maintenance & Community
The project is maintained by Sang Park. Further development is planned to include ROCm and oneAPI support, and to implement Flash Attention correctly.
Licensing & Compatibility
Limitations & Caveats
The tokenizer implementation required "messy monkey patching" for compatibility, with plans for future refinement. Multi-Head Attention is currently handled by a single kernel using GEMV operations, which is noted as somewhat inefficient compared to GEMM.
3 months ago
1 day