SwiftLM by SharpAI

Blazingly fast LLM inference for Apple Silicon

Created 3 months ago

711 stars

Top 47.4% on SourcePulse

View on GitHub

1 Expert Loves This Project

Luis Capelo

Cofounder of Lightning AI

Project Summary

Summary

SwiftLM is a high-performance, native Swift inference server for Apple Silicon, designed to run MLX models directly without a Python runtime. It offers an OpenAI-compatible API, targeting developers and power users seeking maximum efficiency and bare-metal speed on macOS and iOS. The project eliminates Python overhead and GIL contention, providing a streamlined path to LLM deployment on Apple hardware.

How It Works

This project leverages Swift, Apple's MLX framework, and Metal for native GPU acceleration, compiling into a single, efficient binary. It bypasses Python and its associated performance bottlenecks. Key architectural choices include TurboQuantization for aggressive KV cache compression (achieving ~3.6 bits per coordinate) and experimental SSD Expert Streaming, which offloads Mixture of Experts (MoE) layers directly from NVMe SSDs to the GPU command buffer. This approach mitigates macOS unified memory limitations, enabling larger models on constrained hardware.

Quick Start & Requirements

Installation: Download pre-built binaries from the project's Releases page or build from source using the ./build.sh script.
Running: Execute the binary with --model <model_id> and --port <port>, e.g., ./SwiftLM --model mlx-community/Qwen2.5-3B-Instruct-4bit --port 5413. Models download automatically if not cached.
Prerequisites: macOS 14.0+, Apple Silicon (M1/M2/M3/M4/M5), Xcode Command Line Tools, Metal Toolchain (xcodebuild -downloadComponent MetalToolchain).
Setup: Building involves submodule initialization, CMake, Metal kernel compilation, and Swift compilation.

Highlighted Details

100% Native Apple Silicon: Built entirely with Swift and Metal, optimized for Apple hardware.
OpenAI-Compatible API: Provides endpoints like /v1/chat/completions for seamless integration with existing OpenAI SDKs.
TurboQuantization: Implements a hybrid V2+V3 architecture for on-the-fly KV cache compression, achieving near-zero accuracy loss at ~3.5x compression.
SSD Expert Streaming: Experimental feature for zero-copy streaming of MoE layers from NVMe SSD, preventing Watchdog kernel panics on very large models.
SwiftBuddy iOS App: A companion application for on-device LLM inference via MLX Swift on iPhones and iPads.

Maintenance & Community

The project relies heavily on the MLX community and various open-source projects for its foundation. Specific community channels or direct contributor information are not detailed in the README.

Licensing & Compatibility

Licensed under the MIT License, permitting broad use, modification, and distribution, including for commercial purposes.

Limitations & Caveats

The SSD Expert Streaming feature is experimental. Aggressive quantization (e.g., 2-bit) can lead to model instability and break features like OpenAI-compatible tool calling due to JSON grammar corruption.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

25 stars in the last 30 days