mac-code  by walter-grace

AI inference for Macs, running large models out-of-core

Created 2 weeks ago

New!

580 stars

Top 55.6% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

This project enables running large AI models (35B+ parameters) on Apple Silicon Macs with limited RAM by leveraging innovative out-of-core inference techniques. It targets users seeking powerful, free, local LLM capabilities, offering coherent reasoning and agent functionality without compromising quality or incurring cloud costs.

How It Works

The core innovation manages models exceeding RAM. "MoE Expert Sniper" selectively loads active experts from SSD for MoE models, achieving 1.54 tok/s with minimal RAM. "Flash Streaming" uses direct SSD I/O (F_NOCACHE, pread) for dense models, streaming FFN layers to bypass OS cache thrashing and achieve 0.15 tok/s for large models. An "LLM-as-Router" architecture enables self-classification for tool use (search, shell), and KV cache quantization supports 64K context. It supports llama.cpp and MLX backends, with MLX offering persistent context saving to R2.

Quick Start & Requirements

Installation involves brew install llama.cpp or pip3 install mlx-lm, followed by downloading specific GGUF models. The default setup uses llama.cpp with a 35B MoE model (Qwen3.5-35B-A3B-UD-IQ2_M.gguf), run via llama-server and python3 agent.py. An alternative MLX setup uses a 9B model for 64K context. Prerequisites include an Apple Silicon Mac, Python 3, and packages like rich, ddgs, huggingface-hub, mlx-lm.

Highlighted Details

  • Runs 22 GB Qwen3.5-35B MoE on 16 GB RAM using only 1.42 GB RAM at 1.54 tok/s.
  • Achieves 0.15 tok/s for 18.4 GB dense 32B model (4-bit) with 4.5 GB RAM via Flash Streaming, outperforming mmap by 9x.
  • LLM self-classifies intent for tool use (search, shell, chat) with 8/8 accuracy.
  • Enables 64K context via 4x KV cache compression.

Maintenance & Community

Leverages models like Qwen3.5 and engines like llama.cpp/MLX. Builds on Apple ("LLM in a Flash") and Google ("TurboQuant") research. No specific community channels or contributor details are listed.

Licensing & Compatibility

Released under the MIT license, permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

"Flash Streaming" for dense models is experimental and significantly slower. Performance varies by technique (MoE faster than dense streaming). Primarily targets Apple Silicon Macs.

Health Check
Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
4
Star History
590 stars in the last 19 days

Explore Similar Projects

Feedback? Help us improve.