llama2.mojo  by tairov

Mojo code for Llama 2 inference

Created 2 years ago
2,118 stars

Top 21.2% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a single-file Mojo implementation for Llama 2 model inference, targeting developers and researchers interested in high-performance, hardware-accelerated AI. It aims to significantly outperform Python and C implementations by leveraging Mojo's advanced features like SIMD and vectorization.

How It Works

The implementation translates the Llama 2 architecture into pure Mojo code, directly utilizing Mojo's low-level hardware optimization capabilities. This approach allows for aggressive SIMD vectorization and efficient multi-threading, bypassing the overhead typically associated with Python or even optimized C libraries like llama.cpp for certain operations. The project highlights a 250x performance boost over its Python counterpart and a 30% improvement over llama2.c on multi-threaded inference.

Quick Start & Requirements

  • Install Mojo SDK (version 24.3 or later) or use the Mojo Playground.
  • Clone the repository: git clone https://github.com/tairov/llama2.mojo.git
  • Navigate to the directory: cd llama2.mojo
  • Download model weights (e.g., wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin).
  • Run inference: mojo llama2.mojo stories15M.bin -s 100 -n 256 -t 0.5 -i "Mojo is a language"
  • Docker: docker build --build-arg AUTH_KEY=<your-modular-auth-key> -t llama2.mojo . and docker run -it llama2.mojo
  • Gradio UI: Uncomment CMD ["python", "gradio_app.py"] in Dockerfile and run docker run -it -p 0.0.0.0:7860:7860 llama2.mojo.

Highlighted Details

  • Achieves 1025 tok/s on stories15M.bin with 6 threads on an M1 Max, outperforming llama2.c (730 tok/s) and llama.cpp (890 tok/s).
  • Demonstrates significant speedups over Python (250x) and llama2.c (30% faster multi-threaded inference).
  • Supports models like stories (260K, 15M, 42M, 110M) and Tinyllama-1.1B-Chat-v0.2.
  • Includes a Hugging Face Space demo.

Maintenance & Community

The project is maintained by Aydyn Tairov. It encourages academic research and community contributions.

Licensing & Compatibility

  • License: MIT.
  • Compatible with commercial use and closed-source linking.

Limitations & Caveats

The project is focused on inference and does not include training capabilities. Performance benchmarks are primarily on Apple Silicon (M1 Max) and specific CPU configurations, with broader platform performance not extensively detailed.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Junyang Lin Junyang Lin(Core Maintainer at Alibaba Qwen), Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), and
25 more.

alpaca-lora by tloen

0.0%
19k
LoRA fine-tuning for LLaMA
Created 2 years ago
Updated 1 year ago
Starred by Roy Frostig Roy Frostig(Coauthor of JAX; Research Scientist at Google DeepMind), Zhiqiang Xie Zhiqiang Xie(Coauthor of SGLang), and
40 more.

llama by meta-llama

0.1%
59k
Inference code for Llama 2 models (deprecated)
Created 2 years ago
Updated 7 months ago
Feedback? Help us improve.