LLaMA_MPS by jankais3r

LLM inference on Apple Silicon GPUs

Created 2 years ago

585 stars

Top 55.4% on SourcePulse

View on GitHub

2 Experts Love This Project

Jeff Hammerbacher

Cofounder of Cloudera

Wing Lian

Founder of Axolotl AI

Project Summary

This repository enables inference of Meta's LLaMA and Stanford's Alpaca large language models on Apple Silicon GPUs using the Metal Performance Shaders (MPS) backend. It targets developers and researchers with Apple hardware seeking to run these models locally, offering a Python-based solution for efficient on-device execution.

How It Works

The project leverages PyTorch's MPS backend to offload computations to Apple's integrated GPUs. It includes scripts for resharding larger model weights (13B, 30B, 65B) into a single file suitable for single-GPU inference. The core inference is handled by chat.py, which supports both raw LLaMA completion and instruction-following via Alpaca weights.

Quick Start & Requirements

Install dependencies: pip3 install -r requirements.txt and pip3 install -e .
Requires Python 3.x, PyTorch with MPS support.
Model weights must be downloaded and placed in a models directory.
Resharding is necessary for models larger than 7B.
See README for detailed setup and inference commands.

Highlighted Details

Enables LLaMA and Alpaca inference on Apple Silicon GPUs.
Includes scripts for model resharding and state dict conversion.
Provides memory requirements for various model sizes (7B, 13B, 30B, 65B).
Benchmarks show 3.41 tokens/s for 13B fp16 on MPS vs. 3.66 tokens/s for llama.cpp on CPU, with significantly lower power consumption and heat for MPS.

Maintenance & Community

Credits include contributors from Facebook Research, markasoftware, remixer-dec, venuatu, benob, and tloen.
No explicit community links (Discord/Slack) or roadmap are provided in the README.

Licensing & Compatibility

The repository itself appears to be MIT licensed.
However, it relies on LLaMA weights, which have their own specific licensing terms that may restrict commercial use.

Limitations & Caveats

Performance is benchmarked against llama.cpp, which is noted as potentially faster but less power-efficient.
Memory requirements are substantial, with 30B models needing 66GB+ RAM for inference.
Support for 65B models is listed as "needs testing."

Health Check

Last Commit

2 years ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days