SwiftLM  by SharpAI

Blazingly fast LLM inference for Apple Silicon

Created 2 months ago
662 stars

Top 50.3% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Summary

SwiftLM is a high-performance, native Swift inference server for Apple Silicon, designed to run MLX models directly without a Python runtime. It offers an OpenAI-compatible API, targeting developers and power users seeking maximum efficiency and bare-metal speed on macOS and iOS. The project eliminates Python overhead and GIL contention, providing a streamlined path to LLM deployment on Apple hardware.

How It Works

This project leverages Swift, Apple's MLX framework, and Metal for native GPU acceleration, compiling into a single, efficient binary. It bypasses Python and its associated performance bottlenecks. Key architectural choices include TurboQuantization for aggressive KV cache compression (achieving ~3.6 bits per coordinate) and experimental SSD Expert Streaming, which offloads Mixture of Experts (MoE) layers directly from NVMe SSDs to the GPU command buffer. This approach mitigates macOS unified memory limitations, enabling larger models on constrained hardware.

Quick Start & Requirements

  • Installation: Download pre-built binaries from the project's Releases page or build from source using the ./build.sh script.
  • Running: Execute the binary with --model <model_id> and --port <port>, e.g., ./SwiftLM --model mlx-community/Qwen2.5-3B-Instruct-4bit --port 5413. Models download automatically if not cached.
  • Prerequisites: macOS 14.0+, Apple Silicon (M1/M2/M3/M4/M5), Xcode Command Line Tools, Metal Toolchain (xcodebuild -downloadComponent MetalToolchain).
  • Setup: Building involves submodule initialization, CMake, Metal kernel compilation, and Swift compilation.

Highlighted Details

  • 100% Native Apple Silicon: Built entirely with Swift and Metal, optimized for Apple hardware.
  • OpenAI-Compatible API: Provides endpoints like /v1/chat/completions for seamless integration with existing OpenAI SDKs.
  • TurboQuantization: Implements a hybrid V2+V3 architecture for on-the-fly KV cache compression, achieving near-zero accuracy loss at ~3.5x compression.
  • SSD Expert Streaming: Experimental feature for zero-copy streaming of MoE layers from NVMe SSD, preventing Watchdog kernel panics on very large models.
  • SwiftBuddy iOS App: A companion application for on-device LLM inference via MLX Swift on iPhones and iPads.

Maintenance & Community

The project relies heavily on the MLX community and various open-source projects for its foundation. Specific community channels or direct contributor information are not detailed in the README.

Licensing & Compatibility

Licensed under the MIT License, permitting broad use, modification, and distribution, including for commercial purposes.

Limitations & Caveats

The SSD Expert Streaming feature is experimental. Aggressive quantization (e.g., 2-bit) can lead to model instability and break features like OpenAI-compatible tool calling due to JSON grammar corruption.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
12
Issues (30d)
6
Star History
81 stars in the last 30 days

Explore Similar Projects

Starred by Balaji Srinivasan Balaji Srinivasan(Founder of The Network School; Author of "The Network State"; Former CTO of Coinbase; Cofounder of Counsyl), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
13 more.

ds4 by antirez

8.4%
12k
Fast local inference for DeepSeek V4 Flash models
Created 3 weeks ago
Updated 1 day ago
Feedback? Help us improve.