llama3.cuda  by likejazz

C/CUDA implementation for Llama 3 model

created 1 year ago
338 stars

Top 82.6% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a pure C/CUDA implementation of the Llama 3 model, targeting researchers and developers seeking high-performance inference without complex dependencies. It offers a significant speedup over CPU-based implementations, enabling faster experimentation and deployment of Llama 3.

How It Works

The implementation leverages pure C and CUDA for maximum performance and minimal dependencies. It builds upon the foundational work of llama2.c for the model and tokenizer logic, and incorporates CUDA kernels from rogerallen and ankan-ban. The design prioritizes a single-file, dependency-free structure for ease of compilation and use, with careful attention to reducing floating-point errors to match NumPy implementation results.

Quick Start & Requirements

  • Install: make
  • Requirements: NVIDIA GPU with CUDA support.
  • Demo: ./runcuda "I have a dream"

Highlighted Details

  • Achieves 2,823 tokens/s on an NVIDIA 4080 SUPER, an 85x speedup over NumPy on an M2 MacBook Air.
  • Single-file, dependency-free C implementation with Makefile and CMake support.
  • Aims for identical results to NumPy implementation with <0.5% floating-point error rate.
  • Includes a UTF-8 tokenizer implementation.

Maintenance & Community

The project is maintained by Sang Park. Further development is planned to include ROCm and oneAPI support, and to implement Flash Attention correctly.

Licensing & Compatibility

  • License: MIT
  • Compatibility: Permissive for commercial use and integration with closed-source projects.

Limitations & Caveats

The tokenizer implementation required "messy monkey patching" for compatibility, with plans for future refinement. Multi-Head Attention is currently handled by a single kernel using GEMV operations, which is noted as somewhat inefficient compared to GEMM.

Health Check
Last commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
9 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.