llama2.mojo by tairov

Mojo code for Llama 2 inference

Created 2 years ago

2,115 stars

Top 20.9% on SourcePulse

View on GitHub

2 Experts Love This Project

Andreas Jansson

Cofounder of Replicate

Jeff Hammerbacher

Cofounder of Cloudera

Project Summary

This repository provides a single-file Mojo implementation for Llama 2 model inference, targeting developers and researchers interested in high-performance, hardware-accelerated AI. It aims to significantly outperform Python and C implementations by leveraging Mojo's advanced features like SIMD and vectorization.

How It Works

The implementation translates the Llama 2 architecture into pure Mojo code, directly utilizing Mojo's low-level hardware optimization capabilities. This approach allows for aggressive SIMD vectorization and efficient multi-threading, bypassing the overhead typically associated with Python or even optimized C libraries like llama.cpp for certain operations. The project highlights a 250x performance boost over its Python counterpart and a 30% improvement over llama2.c on multi-threaded inference.

Quick Start & Requirements

Install Mojo SDK (version 24.3 or later) or use the Mojo Playground.
Clone the repository: git clone https://github.com/tairov/llama2.mojo.git
Navigate to the directory: cd llama2.mojo
Download model weights (e.g., wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin).
Run inference: mojo llama2.mojo stories15M.bin -s 100 -n 256 -t 0.5 -i "Mojo is a language"
Docker: docker build --build-arg AUTH_KEY=<your-modular-auth-key> -t llama2.mojo . and docker run -it llama2.mojo
Gradio UI: Uncomment CMD ["python", "gradio_app.py"] in Dockerfile and run docker run -it -p 0.0.0.0:7860:7860 llama2.mojo.

Highlighted Details

Achieves 1025 tok/s on stories15M.bin with 6 threads on an M1 Max, outperforming llama2.c (730 tok/s) and llama.cpp (890 tok/s).
Demonstrates significant speedups over Python (250x) and llama2.c (30% faster multi-threaded inference).
Supports models like stories (260K, 15M, 42M, 110M) and Tinyllama-1.1B-Chat-v0.2.
Includes a Hugging Face Space demo.

Maintenance & Community

The project is maintained by Aydyn Tairov. It encourages academic research and community contributions.

Licensing & Compatibility

License: MIT.
Compatible with commercial use and closed-source linking.

Limitations & Caveats

The project is focused on inference and does not include training capabilities. Performance benchmarks are primarily on Apple Silicon (M1 Max) and specific CPU configurations, with broader platform performance not extensively detailed.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days