vllm-metal  by vllm-project

LLM inference acceleration for Apple Silicon

Created 3 months ago
628 stars

Top 52.7% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

vLLM Metal is a community-maintained hardware plugin designed to enable high-performance Large Language Model (LLM) inference on Apple Silicon Macs. It integrates the popular vLLM inference engine with Apple's Metal GPU via the MLX compute backend, offering developers and researchers a way to achieve faster LLM inference and more efficient memory utilization on macOS.

How It Works

This plugin establishes a unified compute backend for vLLM, primarily leveraging MLX for accelerated computations such as attention mechanisms, normalization, and positional encodings. It complements MLX with PyTorch for model loading and interoperability, creating a cohesive system. A core design principle is the exploitation of Apple Silicon's unified memory architecture, facilitating true zero-copy operations to minimize data transfer overhead and boost performance. The plugin layer abstracts the underlying MetalPlatform, MetalWorker, and MetalModelRunner components, ensuring seamless integration with vLLM's engine, scheduler, and API server.

Quick Start & Requirements

  • Installation: Execute curl -fsSL https://raw.githubusercontent.com/vllm-project/vllm-metal/main/install.sh | bash.
  • Prerequisites: Requires macOS running on Apple Silicon hardware. Implicit dependencies include vLLM, MLX, and PyTorch.
  • Configuration: Customization is available via environment variables such as VLLM_METAL_MEMORY_FRACTION for memory allocation, VLLM_METAL_USE_MLX to enable/disable MLX compute, and VLLM_MLX_DEVICE to select the MLX device (GPU or CPU).

Highlighted Details

  • MLX Acceleration: Delivers faster inference speeds compared to PyTorch's MPS backend on Apple Silicon.
  • Unified Memory: Leverages Apple Silicon's architecture for true zero-copy operations, optimizing memory access patterns.
  • vLLM Compatibility: Provides full integration with vLLM's engine, scheduler, and OpenAI-compatible API.
  • Paged Attention: Implements efficient KV cache management, crucial for handling long input sequences.
  • GQA Support: Incorporates Grouped-Query Attention for enhanced inference efficiency.

Maintenance & Community

No specific details regarding maintainers, community channels (like Discord/Slack), or roadmap were provided in the README excerpt.

Licensing & Compatibility

No licensing information or compatibility notes for commercial use were present in the provided README excerpt.

Limitations & Caveats

The provided README excerpt does not detail specific limitations, known bugs, or alpha/beta status, focusing primarily on features and setup instructions.

Health Check
Last Commit

20 hours ago

Responsiveness

Inactive

Pull Requests (30d)
51
Issues (30d)
12
Star History
199 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.