Discover and explore top open-source AI tools and projects—updated daily.
walter-graceAI inference for Macs, running large models out-of-core
New!
Top 55.6% on SourcePulse
Summary
This project enables running large AI models (35B+ parameters) on Apple Silicon Macs with limited RAM by leveraging innovative out-of-core inference techniques. It targets users seeking powerful, free, local LLM capabilities, offering coherent reasoning and agent functionality without compromising quality or incurring cloud costs.
How It Works
The core innovation manages models exceeding RAM. "MoE Expert Sniper" selectively loads active experts from SSD for MoE models, achieving 1.54 tok/s with minimal RAM. "Flash Streaming" uses direct SSD I/O (F_NOCACHE, pread) for dense models, streaming FFN layers to bypass OS cache thrashing and achieve 0.15 tok/s for large models. An "LLM-as-Router" architecture enables self-classification for tool use (search, shell), and KV cache quantization supports 64K context. It supports llama.cpp and MLX backends, with MLX offering persistent context saving to R2.
Quick Start & Requirements
Installation involves brew install llama.cpp or pip3 install mlx-lm, followed by downloading specific GGUF models. The default setup uses llama.cpp with a 35B MoE model (Qwen3.5-35B-A3B-UD-IQ2_M.gguf), run via llama-server and python3 agent.py. An alternative MLX setup uses a 9B model for 64K context. Prerequisites include an Apple Silicon Mac, Python 3, and packages like rich, ddgs, huggingface-hub, mlx-lm.
Highlighted Details
Maintenance & Community
Leverages models like Qwen3.5 and engines like llama.cpp/MLX. Builds on Apple ("LLM in a Flash") and Google ("TurboQuant") research. No specific community channels or contributor details are listed.
Licensing & Compatibility
Released under the MIT license, permissive for commercial use and integration into closed-source projects.
Limitations & Caveats
"Flash Streaming" for dense models is experimental and significantly slower. Performance varies by technique (MoE faster than dense streaming). Primarily targets Apple Silicon Macs.
2 days ago
Inactive
t8
trymirai
Mega4alik