Discover and explore top open-source AI tools and projects—updated daily.
t8LLM inference scheduler for Apple Silicon, optimizing large models across GPU, RAM, and NVMe
New!
Top 63.6% on SourcePulse
Hypura is an LLM inference scheduler specifically designed for Apple Silicon Macs, enabling users to run models that exceed the device's physical memory capacity. By intelligently distributing model tensors across GPU, RAM, and NVMe storage tiers based on access patterns and hardware capabilities, Hypura prevents system crashes and OOM errors, making large models runnable on consumer hardware.
How It Works
Hypura employs a storage-tier-aware scheduling approach. It analyzes model architecture, identifying tensor roles (e.g., attention layers, norms, MoE experts, FFN weights), and maps them to available hardware tiers: GPU (Metal) for fastest access, RAM for overflow, and NVMe for on-demand streaming. For Mixture-of-Experts (MoE) models, it leverages sparsity by loading only active experts from NVMe, utilizing a neuron cache with a 99.5% hit rate to minimize I/O. Dense models too large for GPU memory stream their FFN weights from NVMe through a dynamically sized pool buffer, with automatic scaling of prefetch depth based on available memory.
Quick Start & Requirements
git clone --recurse-submodules https://github.com/hypura/hypura.git, cd hypura, and cargo build --release. The binary is located at target/release/hypura.hypura profile: Profiles hardware (runs once, cached).hypura run <model.gguf> [--prompt "..." | --interactive]: Runs inference or interactive chat.hypura serve <model.gguf>: Starts an Ollama-compatible HTTP server.hypura bench <model.gguf>: Benchmarks Hypura against a baseline.hypura inspect <model.gguf>: Inspects model placement plan.Highlighted Details
llama.cpp to OOM, such as a 31 GB Mixtral 8x7B on a 32 GB Mac Mini (achieving 2.2 tok/s) and a 40 GB Llama 70B (0.3 tok/s).pread() with F_NOCACHE to stream weights, avoiding SSD wear from write cycles.Maintenance & Community
The project's README notes that the code was generated via prompts and not written directly by the owner, describing it as an "exploration." No specific community channels (like Discord or Slack) or roadmap links are provided in the README.
Licensing & Compatibility
The project is licensed under the MIT license, which is permissive and generally compatible with commercial use and closed-source linking.
Limitations & Caveats
Hypura is exclusively designed for Apple Silicon hardware. Building from source is currently required, with a Homebrew tap planned. For untested models, it is recommended to start with a small `--max-tokens` value (e.g., 10) before scaling up. Benchmarking with `--baseline` may be blocked if the model exceeds available RAM minus a 4 GB headroom.
4 days ago
Inactive
S-LoRA
Mega4alik
b4rtaz
ai-dynamo