vllm-mlx by waybarrios

Run LLMs and multimodal models on Apple Silicon via OpenAI API

Created 7 months ago

1,416 stars

Top 27.9% on SourcePulse

Project Summary

This project provides a high-throughput, OpenAI-compatible inference server for Large Language Models (LLMs) and Vision-Language Models (VLMs) specifically optimized for Apple Silicon. It targets developers and researchers seeking to leverage their Mac hardware for accelerated AI tasks, offering multimodal capabilities (text, image, video, audio) and efficient memory management through native MLX integration.

How It Works

vLLM-MLX integrates Apple's MLX framework, including mlx-lm for LLM inference, mlx-vlm for multimodal processing, and mlx-audio for speech tasks, into a vLLM-like serving architecture. This approach utilizes MLX's unified memory and Metal kernels for native GPU acceleration on Apple Silicon. Key optimizations include paged KV cache for memory efficiency and continuous batching to maximize throughput for concurrent user requests.

Quick Start & Requirements

Installation: Clone the repository (git clone https://github.com/waybarrios/vllm-mlx.git), navigate into the directory (cd vllm-mlx), and install using pip (pip install -e .).
Prerequisites: Requires an Apple Silicon Mac (M1, M2, M3, M4). For audio features, additional dependencies include pip install vllm-mlx[audio], python -m spacy download en_core_web_sm, and brew install espeak-ng on macOS.
Links: Documentation is available in the docs directory.

Highlighted Details

Multimodal Support: Processes text, image, video, and audio inputs and outputs within a unified platform.
OpenAI API Compatibility: Functions as a drop-in replacement for OpenAI clients, supporting Model Context Protocol (MCP) tool calling.
Performance: Achieves high throughput on Apple Silicon, with benchmarks showing up to 464 tok/s for single models (e.g., Llama-3.2-1B-4bit) and 1112 tok/s with continuous batching (e.g., Qwen3-0.6B-8bit). Audio STT offers low latency, with whisper-tiny achieving a 197x Real-Time Factor (RTF).
Gemma 3 & Long Context: Includes patches for Gemma 3 vision support and a mechanism to enable extended context lengths (up to ~50K tokens) via environment variables like GEMMA3_SLIDING_WINDOW, requiring manual code modification in mlx-vlm.

Maintenance & Community

The README does not detail specific community channels (e.g., Discord, Slack), notable contributors beyond the primary author (Wayner Barrios), sponsorships, or a public roadmap. Contributions are welcomed via pull requests.

vllm-mlx by waybarrios

Explore Similar Projects

vllm-swift by TheTom

mlx-serve by ddalcu

dash-infer by modelscope

llm-scaler by intel

FlashRT by flashrt-project

vmlx by jjang-ai

candle-vllm by EricLBuehler

vllm-turboquant by mitkox

ZhiLight by zhihu

vllm-metal by vllm-project

tract by sonos

vllm-omni by vllm-project