vllm-metal  by vllm-project

LLM inference acceleration for Apple Silicon

Created 1 month ago
322 stars

Top 84.7% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

vLLM Metal is a community-maintained hardware plugin designed to enable high-performance Large Language Model (LLM) inference on Apple Silicon Macs. It integrates the popular vLLM inference engine with Apple's Metal GPU via the MLX compute backend, offering developers and researchers a way to achieve faster LLM inference and more efficient memory utilization on macOS.

How It Works

This plugin establishes a unified compute backend for vLLM, primarily leveraging MLX for accelerated computations such as attention mechanisms, normalization, and positional encodings. It complements MLX with PyTorch for model loading and interoperability, creating a cohesive system. A core design principle is the exploitation of Apple Silicon's unified memory architecture, facilitating true zero-copy operations to minimize data transfer overhead and boost performance. The plugin layer abstracts the underlying MetalPlatform, MetalWorker, and MetalModelRunner components, ensuring seamless integration with vLLM's engine, scheduler, and API server.

Quick Start & Requirements

  • Installation: Execute curl -fsSL https://raw.githubusercontent.com/vllm-project/vllm-metal/main/install.sh | bash.
  • Prerequisites: Requires macOS running on Apple Silicon hardware. Implicit dependencies include vLLM, MLX, and PyTorch.
  • Configuration: Customization is available via environment variables such as VLLM_METAL_MEMORY_FRACTION for memory allocation, VLLM_METAL_USE_MLX to enable/disable MLX compute, and VLLM_MLX_DEVICE to select the MLX device (GPU or CPU).

Highlighted Details

  • MLX Acceleration: Delivers faster inference speeds compared to PyTorch's MPS backend on Apple Silicon.
  • Unified Memory: Leverages Apple Silicon's architecture for true zero-copy operations, optimizing memory access patterns.
  • vLLM Compatibility: Provides full integration with vLLM's engine, scheduler, and OpenAI-compatible API.
  • Paged Attention: Implements efficient KV cache management, crucial for handling long input sequences.
  • GQA Support: Incorporates Grouped-Query Attention for enhanced inference efficiency.

Maintenance & Community

No specific details regarding maintainers, community channels (like Discord/Slack), or roadmap were provided in the README excerpt.

Licensing & Compatibility

No licensing information or compatibility notes for commercial use were present in the provided README excerpt.

Limitations & Caveats

The provided README excerpt does not detail specific limitations, known bugs, or alpha/beta status, focusing primarily on features and setup instructions.

Health Check
Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)
37
Issues (30d)
9
Star History
195 stars in the last 30 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Clement Delangue Clement Delangue(Cofounder of Hugging Face), and
60 more.

vllm by vllm-project

1.1%
69k
LLM serving engine for high-throughput, memory-efficient inference
Created 3 years ago
Updated 1 day ago
Feedback? Help us improve.