vllm-swift  by TheTom

Native Swift/Metal LLM inference for Apple Silicon

Created 1 month ago
261 stars

Top 97.1% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

This project offers a native Swift and Metal backend for vLLM, specifically targeting high-performance LLM inference on Apple Silicon. By removing Python from the critical inference path, it provides an OpenAI-compatible API and delivers significant speedups, particularly for short-context decoding, benefiting developers and power users seeking efficient on-device LLM execution.

How It Works

The core architecture leverages Swift and Metal for the entire forward pass, with Python primarily used for orchestration tasks like API handling, tokenization, and scheduling. A C bridge facilitates communication between the Python vLLM API and the Swift/Metal engine, enabling batched decoding, attention mechanisms, and other LLM operations to run directly on Apple Silicon GPUs. This native approach minimizes overhead and maximizes performance.

Quick Start & Requirements

  • Primary install:
    • Homebrew: brew tap TheTom/tap && brew install vllm-swift
    • Pip: pip install vllm-swift (includes prebuilt Swift bridge and Metal kernel)
  • Requirements: macOS 11+ (14+ recommended for source builds), Apple Silicon, Python 3.10+.
  • Running: Download models with vllm-swift download <model> and serve with vllm-swift serve <model_path> --max-model-len <length>. The server runs at http://localhost:8000.
  • Links: Documentation available at docs/PERFORMANCE.md, docs/MODEL_COMPATIBILITY.md, docs/TROUBLESHOOTING.md, and CHANGELOG.md.

Highlighted Details

  • Achieves up to 2.6x faster short-context decode throughput compared to Python/MLX-based vllm-metal on comparable hardware.
  • Provides an OpenAI-compatible API supporting streaming responses, chat templates, and tool/function calling.
  • Features TurboQuant+ KV cache compression (e.g., turbo4v2) for significantly increased context length with modest throughput impact.
  • Includes experimental support for TriAttention V3 (query-aware KV-cache eviction) and longctx (code-aware retrieval) for handling extremely long contexts, particularly in multi-turn chat scenarios.
  • Supports chain-of-thought reasoning parsing and experimental Vision-Language Model (VLM) capabilities.

Licensing & Compatibility

The project is licensed under the Apache-2.0 license, permitting commercial use and integration into closed-source applications. It is designed exclusively for macOS on Apple Silicon hardware.

Limitations & Caveats

LoRA fine-tuning is not supported due to limitations in the Swift engine. Chunked prefill is disabled, with the Swift engine handling full sequences. top_p sampling is unavailable in the batched decode path, though temperature sampling is functional. Only Qwen3 models fully utilize the batched decode path; other architectures may fall back to slower sequential decoding at high concurrency. The TriAttention V3 cache is FP16-only and cannot currently be stacked with TurboQuant codecs. longctx-svc is an alpha companion service, and TriAttention V3's auto rehydration is limited to multi-turn chat completions.

Health Check
Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)
7
Issues (30d)
14
Star History
109 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.