vllm-mlx  by waybarrios

Run LLMs and multimodal models on Apple Silicon via OpenAI API

Created 2 months ago
419 stars

Top 70.2% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides a high-throughput, OpenAI-compatible inference server for Large Language Models (LLMs) and Vision-Language Models (VLMs) specifically optimized for Apple Silicon. It targets developers and researchers seeking to leverage their Mac hardware for accelerated AI tasks, offering multimodal capabilities (text, image, video, audio) and efficient memory management through native MLX integration.

How It Works

vLLM-MLX integrates Apple's MLX framework, including mlx-lm for LLM inference, mlx-vlm for multimodal processing, and mlx-audio for speech tasks, into a vLLM-like serving architecture. This approach utilizes MLX's unified memory and Metal kernels for native GPU acceleration on Apple Silicon. Key optimizations include paged KV cache for memory efficiency and continuous batching to maximize throughput for concurrent user requests.

Quick Start & Requirements

  • Installation: Clone the repository (git clone https://github.com/waybarrios/vllm-mlx.git), navigate into the directory (cd vllm-mlx), and install using pip (pip install -e .).
  • Prerequisites: Requires an Apple Silicon Mac (M1, M2, M3, M4). For audio features, additional dependencies include pip install vllm-mlx[audio], python -m spacy download en_core_web_sm, and brew install espeak-ng on macOS.
  • Links: Documentation is available in the docs directory.

Highlighted Details

  • Multimodal Support: Processes text, image, video, and audio inputs and outputs within a unified platform.
  • OpenAI API Compatibility: Functions as a drop-in replacement for OpenAI clients, supporting Model Context Protocol (MCP) tool calling.
  • Performance: Achieves high throughput on Apple Silicon, with benchmarks showing up to 464 tok/s for single models (e.g., Llama-3.2-1B-4bit) and 1112 tok/s with continuous batching (e.g., Qwen3-0.6B-8bit). Audio STT offers low latency, with whisper-tiny achieving a 197x Real-Time Factor (RTF).
  • Gemma 3 & Long Context: Includes patches for Gemma 3 vision support and a mechanism to enable extended context lengths (up to ~50K tokens) via environment variables like GEMMA3_SLIDING_WINDOW, requiring manual code modification in mlx-vlm.

Maintenance & Community

The README does not detail specific community channels (e.g., Discord, Slack), notable contributors beyond the primary author (Wayner Barrios), sponsorships, or a public roadmap. Contributions are welcomed via pull requests.

Licensing & Compatibility

  • License: Apache 2.0.
  • Compatibility: The Apache 2.0 license generally permits commercial use and integration with closed-source projects, subject to the terms outlined in the LICENSE file.

Limitations & Caveats

  • Platform Specific: Strictly limited to Apple Silicon hardware.
  • Long Context Patching: Achieving maximum context lengths for models like Gemma 3 requires manual code patching and environment variable configuration, indicating this feature may be experimental or require advanced setup.
  • Audio Dependencies: Full audio functionality necessitates additional system-level installations (e.g., espeak-ng) and specific Python package extras.
Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
53
Issues (30d)
32
Star History
218 stars in the last 30 days

Explore Similar Projects

Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI) and Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory).

ZhiLight by zhihu

0%
906
LLM inference engine for Llama and variants, optimized for PCIe GPUs
Created 1 year ago
Updated 1 day ago
Starred by Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), Elvis Saravia Elvis Saravia(Founder of DAIR.AI), and
2 more.

vllm-omni by vllm-project

1.6%
3k
Omni-modality model inference and serving framework
Created 5 months ago
Updated 21 hours ago
Starred by Luis Capelo Luis Capelo(Cofounder of Lightning AI), Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and
4 more.

ktransformers by kvcache-ai

0.2%
17k
Framework for LLM inference optimization experimentation
Created 1 year ago
Updated 1 day ago
Feedback? Help us improve.