Discover and explore top open-source AI tools and projects—updated daily.
TheTomNative Swift/Metal LLM inference for Apple Silicon
Top 97.1% on SourcePulse
Summary
This project offers a native Swift and Metal backend for vLLM, specifically targeting high-performance LLM inference on Apple Silicon. By removing Python from the critical inference path, it provides an OpenAI-compatible API and delivers significant speedups, particularly for short-context decoding, benefiting developers and power users seeking efficient on-device LLM execution.
How It Works
The core architecture leverages Swift and Metal for the entire forward pass, with Python primarily used for orchestration tasks like API handling, tokenization, and scheduling. A C bridge facilitates communication between the Python vLLM API and the Swift/Metal engine, enabling batched decoding, attention mechanisms, and other LLM operations to run directly on Apple Silicon GPUs. This native approach minimizes overhead and maximizes performance.
Quick Start & Requirements
brew tap TheTom/tap && brew install vllm-swiftpip install vllm-swift (includes prebuilt Swift bridge and Metal kernel)vllm-swift download <model> and serve with vllm-swift serve <model_path> --max-model-len <length>. The server runs at http://localhost:8000.docs/PERFORMANCE.md, docs/MODEL_COMPATIBILITY.md, docs/TROUBLESHOOTING.md, and CHANGELOG.md.Highlighted Details
turbo4v2) for significantly increased context length with modest throughput impact.longctx (code-aware retrieval) for handling extremely long contexts, particularly in multi-turn chat scenarios.Licensing & Compatibility
The project is licensed under the Apache-2.0 license, permitting commercial use and integration into closed-source applications. It is designed exclusively for macOS on Apple Silicon hardware.
Limitations & Caveats
LoRA fine-tuning is not supported due to limitations in the Swift engine. Chunked prefill is disabled, with the Swift engine handling full sequences. top_p sampling is unavailable in the batched decode path, though temperature sampling is functional. Only Qwen3 models fully utilize the batched decode path; other architectures may fall back to slower sequential decoding at high concurrency. The TriAttention V3 cache is FP16-only and cannot currently be stacked with TurboQuant codecs. longctx-svc is an alpha companion service, and TriAttention V3's auto rehydration is limited to multi-turn chat completions.
2 days ago
Inactive