hypura by t8

LLM inference scheduler for Apple Silicon, optimizing large models across GPU, RAM, and NVMe

Created 4 months ago

661 stars

Top 50.1% on SourcePulse

View on GitHub

3 Experts Love This Project

Jonathan Ragan-Kelley

Professor at MIT

Jeff Hammerbacher

Cofounder of Cloudera

Jane Manchun Wong

Security Researcher

Project Summary

Hypura is an LLM inference scheduler specifically designed for Apple Silicon Macs, enabling users to run models that exceed the device's physical memory capacity. By intelligently distributing model tensors across GPU, RAM, and NVMe storage tiers based on access patterns and hardware capabilities, Hypura prevents system crashes and OOM errors, making large models runnable on consumer hardware.

How It Works

Hypura employs a storage-tier-aware scheduling approach. It analyzes model architecture, identifying tensor roles (e.g., attention layers, norms, MoE experts, FFN weights), and maps them to available hardware tiers: GPU (Metal) for fastest access, RAM for overflow, and NVMe for on-demand streaming. For Mixture-of-Experts (MoE) models, it leverages sparsity by loading only active experts from NVMe, utilizing a neuron cache with a 99.5% hit rate to minimize I/O. Dense models too large for GPU memory stream their FFN weights from NVMe through a dynamically sized pool buffer, with automatic scaling of prefetch depth based on available memory.

Quick Start & Requirements

Installation: Requires Rust 1.75+ and CMake. Build from source using git clone --recurse-submodules https://github.com/hypura/hypura.git, cd hypura, and cargo build --release. The binary is located at target/release/hypura.
Prerequisites: Apple Silicon hardware.
Primary Commands:
- hypura profile: Profiles hardware (runs once, cached).
- hypura run <model.gguf> [--prompt "..." | --interactive]: Runs inference or interactive chat.
- hypura serve <model.gguf>: Starts an Ollama-compatible HTTP server.
- hypura bench <model.gguf>: Benchmarks Hypura against a baseline.
- hypura inspect <model.gguf>: Inspects model placement plan.
Links: GitHub Repository

Highlighted Details

Enables running models that cause llama.cpp to OOM, such as a 31 GB Mixtral 8x7B on a 32 GB Mac Mini (achieving 2.2 tok/s) and a 40 GB Llama 70B (0.3 tok/s).
For models that fit within GPU and RAM, Hypura adds zero overhead, running at full Metal GPU speed.
Expert-streaming for MoE models and FFN-streaming for dense models are automatically selected based on model size and hardware.
NVMe I/O is read-only, using pread() with F_NOCACHE to stream weights, avoiding SSD wear from write cycles.

Maintenance & Community

The project's README notes that the code was generated via prompts and not written directly by the owner, describing it as an "exploration." No specific community channels (like Discord or Slack) or roadmap links are provided in the README.

Licensing & Compatibility

The project is licensed under the MIT license, which is permissive and generally compatible with commercial use and closed-source linking.

Limitations & Caveats

Hypura is exclusively designed for Apple Silicon hardware. Building from source is currently required, with a Homebrew tap planned. For untested models, it is recommended to start with a small `--max-tokens` value (e.g., 10) before scaling up. Benchmarking with `--baseline` may be blocked if the model exceeds available RAM minus a 4 GB headroom.

Health Check

Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

9 stars in the last 30 days