LLaMA_MPS  by jankais3r

LLM inference on Apple Silicon GPUs

created 2 years ago
589 stars

Top 56.0% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository enables inference of Meta's LLaMA and Stanford's Alpaca large language models on Apple Silicon GPUs using the Metal Performance Shaders (MPS) backend. It targets developers and researchers with Apple hardware seeking to run these models locally, offering a Python-based solution for efficient on-device execution.

How It Works

The project leverages PyTorch's MPS backend to offload computations to Apple's integrated GPUs. It includes scripts for resharding larger model weights (13B, 30B, 65B) into a single file suitable for single-GPU inference. The core inference is handled by chat.py, which supports both raw LLaMA completion and instruction-following via Alpaca weights.

Quick Start & Requirements

  • Install dependencies: pip3 install -r requirements.txt and pip3 install -e .
  • Requires Python 3.x, PyTorch with MPS support.
  • Model weights must be downloaded and placed in a models directory.
  • Resharding is necessary for models larger than 7B.
  • See README for detailed setup and inference commands.

Highlighted Details

  • Enables LLaMA and Alpaca inference on Apple Silicon GPUs.
  • Includes scripts for model resharding and state dict conversion.
  • Provides memory requirements for various model sizes (7B, 13B, 30B, 65B).
  • Benchmarks show 3.41 tokens/s for 13B fp16 on MPS vs. 3.66 tokens/s for llama.cpp on CPU, with significantly lower power consumption and heat for MPS.

Maintenance & Community

  • Credits include contributors from Facebook Research, markasoftware, remixer-dec, venuatu, benob, and tloen.
  • No explicit community links (Discord/Slack) or roadmap are provided in the README.

Licensing & Compatibility

  • The repository itself appears to be MIT licensed.
  • However, it relies on LLaMA weights, which have their own specific licensing terms that may restrict commercial use.

Limitations & Caveats

  • Performance is benchmarked against llama.cpp, which is noted as potentially faster but less power-efficient.
  • Memory requirements are substantial, with 30B models needing 66GB+ RAM for inference.
  • Support for 65B models is listed as "needs testing."
Health Check
Last commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 90 days

Explore Similar Projects

Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
10 more.

qlora by artidoro

0.2%
11k
Finetuning tool for quantized LLMs
created 2 years ago
updated 1 year ago
Feedback? Help us improve.