mac-code by walter-grace

AI inference for Macs, running large models out-of-core

Created 3 months ago

1,021 stars

Top 35.9% on SourcePulse

Project Summary

Summary

This project enables running large AI models (35B+ parameters) on Apple Silicon Macs with limited RAM by leveraging innovative out-of-core inference techniques. It targets users seeking powerful, free, local LLM capabilities, offering coherent reasoning and agent functionality without compromising quality or incurring cloud costs.

How It Works

The core innovation manages models exceeding RAM. "MoE Expert Sniper" selectively loads active experts from SSD for MoE models, achieving 1.54 tok/s with minimal RAM. "Flash Streaming" uses direct SSD I/O (F_NOCACHE, pread) for dense models, streaming FFN layers to bypass OS cache thrashing and achieve 0.15 tok/s for large models. An "LLM-as-Router" architecture enables self-classification for tool use (search, shell), and KV cache quantization supports 64K context. It supports llama.cpp and MLX backends, with MLX offering persistent context saving to R2.

Quick Start & Requirements

Installation involves brew install llama.cpp or pip3 install mlx-lm, followed by downloading specific GGUF models. The default setup uses llama.cpp with a 35B MoE model (Qwen3.5-35B-A3B-UD-IQ2_M.gguf), run via llama-server and python3 agent.py. An alternative MLX setup uses a 9B model for 64K context. Prerequisites include an Apple Silicon Mac, Python 3, and packages like rich, ddgs, huggingface-hub, mlx-lm.

Highlighted Details

Runs 22 GB Qwen3.5-35B MoE on 16 GB RAM using only 1.42 GB RAM at 1.54 tok/s.
Achieves 0.15 tok/s for 18.4 GB dense 32B model (4-bit) with 4.5 GB RAM via Flash Streaming, outperforming mmap by 9x.
LLM self-classifies intent for tool use (search, shell, chat) with 8/8 accuracy.
Enables 64K context via 4x KV cache compression.

Maintenance & Community

Leverages models like Qwen3.5 and engines like llama.cpp/MLX. Builds on Apple ("LLM in a Flash") and Google ("TurboQuant") research. No specific community channels or contributor details are listed.

Licensing & Compatibility

Released under the MIT license, permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

"Flash Streaming" for dense models is experimental and significantly slower. Performance varies by technique (MoE faster than dense streaming). Primarily targets Apple Silicon Macs.

mac-code by walter-grace

Explore Similar Projects

Amis by cPilot-GUI

ntransformer by xaskasdf

hypura by t8

MoE-Infinity by EfficientMoE

flashtensors by leoheuler

xinfer by guoqingbao

SwiftLM by SharpAI

binary-mlc-llm-libs by mlc-ai

atlas by Avarok-Cybersecurity

ollm by Mega4alik

colibri by JustVugg

ds4 by antirez