LLM inference on Apple Silicon GPUs
Top 56.0% on sourcepulse
This repository enables inference of Meta's LLaMA and Stanford's Alpaca large language models on Apple Silicon GPUs using the Metal Performance Shaders (MPS) backend. It targets developers and researchers with Apple hardware seeking to run these models locally, offering a Python-based solution for efficient on-device execution.
How It Works
The project leverages PyTorch's MPS backend to offload computations to Apple's integrated GPUs. It includes scripts for resharding larger model weights (13B, 30B, 65B) into a single file suitable for single-GPU inference. The core inference is handled by chat.py
, which supports both raw LLaMA completion and instruction-following via Alpaca weights.
Quick Start & Requirements
pip3 install -r requirements.txt
and pip3 install -e .
models
directory.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
llama.cpp
, which is noted as potentially faster but less power-efficient.2 years ago
Inactive