hypura  by t8

LLM inference scheduler for Apple Silicon, optimizing large models across GPU, RAM, and NVMe

Created 1 week ago

New!

484 stars

Top 63.6% on SourcePulse

GitHubView on GitHub
Project Summary

Hypura is an LLM inference scheduler specifically designed for Apple Silicon Macs, enabling users to run models that exceed the device's physical memory capacity. By intelligently distributing model tensors across GPU, RAM, and NVMe storage tiers based on access patterns and hardware capabilities, Hypura prevents system crashes and OOM errors, making large models runnable on consumer hardware.

How It Works

Hypura employs a storage-tier-aware scheduling approach. It analyzes model architecture, identifying tensor roles (e.g., attention layers, norms, MoE experts, FFN weights), and maps them to available hardware tiers: GPU (Metal) for fastest access, RAM for overflow, and NVMe for on-demand streaming. For Mixture-of-Experts (MoE) models, it leverages sparsity by loading only active experts from NVMe, utilizing a neuron cache with a 99.5% hit rate to minimize I/O. Dense models too large for GPU memory stream their FFN weights from NVMe through a dynamically sized pool buffer, with automatic scaling of prefetch depth based on available memory.

Quick Start & Requirements

  • Installation: Requires Rust 1.75+ and CMake. Build from source using git clone --recurse-submodules https://github.com/hypura/hypura.git, cd hypura, and cargo build --release. The binary is located at target/release/hypura.
  • Prerequisites: Apple Silicon hardware.
  • Primary Commands:
    • hypura profile: Profiles hardware (runs once, cached).
    • hypura run <model.gguf> [--prompt "..." | --interactive]: Runs inference or interactive chat.
    • hypura serve <model.gguf>: Starts an Ollama-compatible HTTP server.
    • hypura bench <model.gguf>: Benchmarks Hypura against a baseline.
    • hypura inspect <model.gguf>: Inspects model placement plan.
  • Links: GitHub Repository

Highlighted Details

  • Enables running models that cause llama.cpp to OOM, such as a 31 GB Mixtral 8x7B on a 32 GB Mac Mini (achieving 2.2 tok/s) and a 40 GB Llama 70B (0.3 tok/s).
  • For models that fit within GPU and RAM, Hypura adds zero overhead, running at full Metal GPU speed.
  • Expert-streaming for MoE models and FFN-streaming for dense models are automatically selected based on model size and hardware.
  • NVMe I/O is read-only, using pread() with F_NOCACHE to stream weights, avoiding SSD wear from write cycles.

Maintenance & Community

The project's README notes that the code was generated via prompts and not written directly by the owner, describing it as an "exploration." No specific community channels (like Discord or Slack) or roadmap links are provided in the README.

Licensing & Compatibility

The project is licensed under the MIT license, which is permissive and generally compatible with commercial use and closed-source linking.

Limitations & Caveats

Hypura is exclusively designed for Apple Silicon hardware. Building from source is currently required, with a Homebrew tap planned. For untested models, it is recommended to start with a small `--max-tokens` value (e.g., 10) before scaling up. Benchmarking with `--baseline` may be blocked if the model exceeds available RAM minus a 4 GB headroom.

Health Check
Last Commit

4 days ago

Responsiveness

Inactive

Pull Requests (30d)
4
Issues (30d)
2
Star History
490 stars in the last 13 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Johannes Hagemann Johannes Hagemann(Cofounder of Prime Intellect), and
4 more.

S-LoRA by S-LoRA

0.1%
2k
System for scalable LoRA adapter serving
Created 2 years ago
Updated 2 years ago
Feedback? Help us improve.