LLaMA_MPS  by jankais3r

LLM inference on Apple Silicon GPUs

Created 2 years ago
587 stars

Top 55.3% on SourcePulse

GitHubView on GitHub
Project Summary

This repository enables inference of Meta's LLaMA and Stanford's Alpaca large language models on Apple Silicon GPUs using the Metal Performance Shaders (MPS) backend. It targets developers and researchers with Apple hardware seeking to run these models locally, offering a Python-based solution for efficient on-device execution.

How It Works

The project leverages PyTorch's MPS backend to offload computations to Apple's integrated GPUs. It includes scripts for resharding larger model weights (13B, 30B, 65B) into a single file suitable for single-GPU inference. The core inference is handled by chat.py, which supports both raw LLaMA completion and instruction-following via Alpaca weights.

Quick Start & Requirements

  • Install dependencies: pip3 install -r requirements.txt and pip3 install -e .
  • Requires Python 3.x, PyTorch with MPS support.
  • Model weights must be downloaded and placed in a models directory.
  • Resharding is necessary for models larger than 7B.
  • See README for detailed setup and inference commands.

Highlighted Details

  • Enables LLaMA and Alpaca inference on Apple Silicon GPUs.
  • Includes scripts for model resharding and state dict conversion.
  • Provides memory requirements for various model sizes (7B, 13B, 30B, 65B).
  • Benchmarks show 3.41 tokens/s for 13B fp16 on MPS vs. 3.66 tokens/s for llama.cpp on CPU, with significantly lower power consumption and heat for MPS.

Maintenance & Community

  • Credits include contributors from Facebook Research, markasoftware, remixer-dec, venuatu, benob, and tloen.
  • No explicit community links (Discord/Slack) or roadmap are provided in the README.

Licensing & Compatibility

  • The repository itself appears to be MIT licensed.
  • However, it relies on LLaMA weights, which have their own specific licensing terms that may restrict commercial use.

Limitations & Caveats

  • Performance is benchmarked against llama.cpp, which is noted as potentially faster but less power-efficient.
  • Memory requirements are substantial, with 30B models needing 66GB+ RAM for inference.
  • Support for 65B models is listed as "needs testing."
Health Check
Last Commit

2 years ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Wing Lian Wing Lian(Founder of Axolotl AI) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

airllm by lyogavin

0.1%
6k
Inference optimization for LLMs on low-resource hardware
Created 2 years ago
Updated 2 weeks ago
Feedback? Help us improve.