vmlx  by jjang-ai

Local AI engine for Apple Silicon

Created 3 months ago
641 stars

Top 51.4% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

vMLX is a local AI engine for Apple Silicon Macs, enabling users to run LLMs, VLMs, and image generation models entirely on-device. It provides an OpenAI, Anthropic, and Ollama-compatible API, ensuring privacy and eliminating cloud dependencies. This offers a high-performance, local inference solution for developers and power users.

How It Works

Built on Apple's MLX framework, vMLX optimizes inference for Metal GPUs. Its key innovation is "JANG" adaptive mixed-precision quantization, achieving superior accuracy at lower bitwidths compared to standard MLX quantization. A sophisticated 5-layer caching architecture (continuous batching, prefix cache, paged KV cache, disk cache) drastically reduces latency. For models exceeding single-machine capacity, vMLX supports pipeline parallelism across multiple Macs.

Quick Start & Requirements

Install via pip install vmlx or uv tool install vmlx. Serve a model with vmlx serve mlx-community/Qwen3-8B-4bit. Requires Apple Silicon hardware. macOS 14+ users should use uv, pipx, or a virtual environment for installation. MLX Studio, a native macOS GUI app, is also available.

Highlighted Details

  • JANG Quantization: Advanced adaptive mixed-precision (e.g., JANG_2L) offers state-of-the-art compression and accuracy.
  • Advanced Caching: Features continuous batching, prefix cache, paged KV cache, KV cache quantization (q4/q8), and disk caching for rapid responses.
  • Distributed Inference: Enables pipeline parallelism across multiple Macs for scaling model capacity.
  • API Compatibility: Seamlessly integrates with OpenAI, Anthropic, and Ollama SDKs/CLIs.
  • MLX Studio: Native macOS application with chat, model management, image generation, and developer tools.
  • Smelt Mode: Allows running extremely large MoE models by loading subsets from SSD, reducing RAM needs.

Maintenance & Community

Developed by Jinho Jang (JANGQ AI). While specific community channels are not detailed, the project appears actively maintained.

Licensing & Compatibility

Distributed under the permissive Apache License 2.0, permitting commercial use and integration into closed-source applications.

Limitations & Caveats

Primarily optimized for Apple Silicon Macs. Smelt mode is mutually exclusive with VLM capabilities and requires JANG-formatted MoE models. Some features, like Qwen Image Edit, have high RAM requirements (~54 GB). Bare pip installations may face issues on macOS 14+.

Health Check
Last Commit

15 hours ago

Responsiveness

Inactive

Pull Requests (30d)
12
Issues (30d)
21
Star History
149 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.