vllm-metal by vllm-project

LLM inference acceleration for Apple Silicon

Created 3 months ago

628 stars

Top 52.7% on SourcePulse

Project Summary

Summary

vLLM Metal is a community-maintained hardware plugin designed to enable high-performance Large Language Model (LLM) inference on Apple Silicon Macs. It integrates the popular vLLM inference engine with Apple's Metal GPU via the MLX compute backend, offering developers and researchers a way to achieve faster LLM inference and more efficient memory utilization on macOS.

How It Works

This plugin establishes a unified compute backend for vLLM, primarily leveraging MLX for accelerated computations such as attention mechanisms, normalization, and positional encodings. It complements MLX with PyTorch for model loading and interoperability, creating a cohesive system. A core design principle is the exploitation of Apple Silicon's unified memory architecture, facilitating true zero-copy operations to minimize data transfer overhead and boost performance. The plugin layer abstracts the underlying MetalPlatform, MetalWorker, and MetalModelRunner components, ensuring seamless integration with vLLM's engine, scheduler, and API server.

Quick Start & Requirements

Installation: Execute curl -fsSL https://raw.githubusercontent.com/vllm-project/vllm-metal/main/install.sh | bash.
Prerequisites: Requires macOS running on Apple Silicon hardware. Implicit dependencies include vLLM, MLX, and PyTorch.
Configuration: Customization is available via environment variables such as VLLM_METAL_MEMORY_FRACTION for memory allocation, VLLM_METAL_USE_MLX to enable/disable MLX compute, and VLLM_MLX_DEVICE to select the MLX device (GPU or CPU).

Highlighted Details

MLX Acceleration: Delivers faster inference speeds compared to PyTorch's MPS backend on Apple Silicon.
Unified Memory: Leverages Apple Silicon's architecture for true zero-copy operations, optimizing memory access patterns.
vLLM Compatibility: Provides full integration with vLLM's engine, scheduler, and OpenAI-compatible API.
Paged Attention: Implements efficient KV cache management, crucial for handling long input sequences.
GQA Support: Incorporates Grouped-Query Attention for enhanced inference efficiency.

Maintenance & Community

No specific details regarding maintainers, community channels (like Discord/Slack), or roadmap were provided in the README excerpt.

Licensing & Compatibility

No licensing information or compatibility notes for commercial use were present in the provided README excerpt.

Limitations & Caveats

The provided README excerpt does not detail specific limitations, known bugs, or alpha/beta status, focusing primarily on features and setup instructions.

vllm-metal by vllm-project

Explore Similar Projects

KVSplit by dipampaul17

ntransformer by xaskasdf

Nanoflow by efeslab

omniserve by mit-han-lab

sarathi-serve by microsoft

candle-vllm by EricLBuehler

hpc-ops by Tencent

vllm-mlx by waybarrios

LiteRT-LM by google-ai-edge

picolm by RightNow-AI

distributed-llama by b4rtaz

tiny-llm by skyzh