llama2.mojo  by tairov

Mojo code for Llama 2 inference

created 1 year ago
2,115 stars

Top 21.7% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a single-file Mojo implementation for Llama 2 model inference, targeting developers and researchers interested in high-performance, hardware-accelerated AI. It aims to significantly outperform Python and C implementations by leveraging Mojo's advanced features like SIMD and vectorization.

How It Works

The implementation translates the Llama 2 architecture into pure Mojo code, directly utilizing Mojo's low-level hardware optimization capabilities. This approach allows for aggressive SIMD vectorization and efficient multi-threading, bypassing the overhead typically associated with Python or even optimized C libraries like llama.cpp for certain operations. The project highlights a 250x performance boost over its Python counterpart and a 30% improvement over llama2.c on multi-threaded inference.

Quick Start & Requirements

  • Install Mojo SDK (version 24.3 or later) or use the Mojo Playground.
  • Clone the repository: git clone https://github.com/tairov/llama2.mojo.git
  • Navigate to the directory: cd llama2.mojo
  • Download model weights (e.g., wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin).
  • Run inference: mojo llama2.mojo stories15M.bin -s 100 -n 256 -t 0.5 -i "Mojo is a language"
  • Docker: docker build --build-arg AUTH_KEY=<your-modular-auth-key> -t llama2.mojo . and docker run -it llama2.mojo
  • Gradio UI: Uncomment CMD ["python", "gradio_app.py"] in Dockerfile and run docker run -it -p 0.0.0.0:7860:7860 llama2.mojo.

Highlighted Details

  • Achieves 1025 tok/s on stories15M.bin with 6 threads on an M1 Max, outperforming llama2.c (730 tok/s) and llama.cpp (890 tok/s).
  • Demonstrates significant speedups over Python (250x) and llama2.c (30% faster multi-threaded inference).
  • Supports models like stories (260K, 15M, 42M, 110M) and Tinyllama-1.1B-Chat-v0.2.
  • Includes a Hugging Face Space demo.

Maintenance & Community

The project is maintained by Aydyn Tairov. It encourages academic research and community contributions.

Licensing & Compatibility

  • License: MIT.
  • Compatible with commercial use and closed-source linking.

Limitations & Caveats

The project is focused on inference and does not include training capabilities. Performance benchmarks are primarily on Apple Silicon (M1 Max) and specific CPU configurations, with broader platform performance not extensively detailed.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
14 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.