Mojo code for Llama 2 inference
Top 21.7% on sourcepulse
This repository provides a single-file Mojo implementation for Llama 2 model inference, targeting developers and researchers interested in high-performance, hardware-accelerated AI. It aims to significantly outperform Python and C implementations by leveraging Mojo's advanced features like SIMD and vectorization.
How It Works
The implementation translates the Llama 2 architecture into pure Mojo code, directly utilizing Mojo's low-level hardware optimization capabilities. This approach allows for aggressive SIMD vectorization and efficient multi-threading, bypassing the overhead typically associated with Python or even optimized C libraries like llama.cpp
for certain operations. The project highlights a 250x performance boost over its Python counterpart and a 30% improvement over llama2.c
on multi-threaded inference.
Quick Start & Requirements
git clone https://github.com/tairov/llama2.mojo.git
cd llama2.mojo
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin
).mojo llama2.mojo stories15M.bin -s 100 -n 256 -t 0.5 -i "Mojo is a language"
docker build --build-arg AUTH_KEY=<your-modular-auth-key> -t llama2.mojo .
and docker run -it llama2.mojo
CMD ["python", "gradio_app.py"]
in Dockerfile and run docker run -it -p 0.0.0.0:7860:7860 llama2.mojo
.Highlighted Details
stories15M.bin
with 6 threads on an M1 Max, outperforming llama2.c
(730 tok/s) and llama.cpp
(890 tok/s).llama2.c
(30% faster multi-threaded inference).stories
(260K, 15M, 42M, 110M) and Tinyllama-1.1B-Chat-v0.2
.Maintenance & Community
The project is maintained by Aydyn Tairov. It encourages academic research and community contributions.
Licensing & Compatibility
Limitations & Caveats
The project is focused on inference and does not include training capabilities. Performance benchmarks are primarily on Apple Silicon (M1 Max) and specific CPU configurations, with broader platform performance not extensively detailed.
1 year ago
1 day