vmlx  by jjang-ai

Local AI engine for Apple Silicon

Created 2 months ago
414 stars

Top 70.4% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

vMLX is a local AI engine for Apple Silicon Macs, enabling users to run LLMs, VLMs, and image generation models entirely on-device. It provides an OpenAI, Anthropic, and Ollama-compatible API, ensuring privacy and eliminating cloud dependencies. This offers a high-performance, local inference solution for developers and power users.

How It Works

Built on Apple's MLX framework, vMLX optimizes inference for Metal GPUs. Its key innovation is "JANG" adaptive mixed-precision quantization, achieving superior accuracy at lower bitwidths compared to standard MLX quantization. A sophisticated 5-layer caching architecture (continuous batching, prefix cache, paged KV cache, disk cache) drastically reduces latency. For models exceeding single-machine capacity, vMLX supports pipeline parallelism across multiple Macs.

Quick Start & Requirements

Install via pip install vmlx or uv tool install vmlx. Serve a model with vmlx serve mlx-community/Qwen3-8B-4bit. Requires Apple Silicon hardware. macOS 14+ users should use uv, pipx, or a virtual environment for installation. MLX Studio, a native macOS GUI app, is also available.

Highlighted Details

  • JANG Quantization: Advanced adaptive mixed-precision (e.g., JANG_2L) offers state-of-the-art compression and accuracy.
  • Advanced Caching: Features continuous batching, prefix cache, paged KV cache, KV cache quantization (q4/q8), and disk caching for rapid responses.
  • Distributed Inference: Enables pipeline parallelism across multiple Macs for scaling model capacity.
  • API Compatibility: Seamlessly integrates with OpenAI, Anthropic, and Ollama SDKs/CLIs.
  • MLX Studio: Native macOS application with chat, model management, image generation, and developer tools.
  • Smelt Mode: Allows running extremely large MoE models by loading subsets from SSD, reducing RAM needs.

Maintenance & Community

Developed by Jinho Jang (JANGQ AI). While specific community channels are not detailed, the project appears actively maintained.

Licensing & Compatibility

Distributed under the permissive Apache License 2.0, permitting commercial use and integration into closed-source applications.

Limitations & Caveats

Primarily optimized for Apple Silicon Macs. Smelt mode is mutually exclusive with VLM capabilities and requires JANG-formatted MoE models. Some features, like Qwen Image Edit, have high RAM requirements (~54 GB). Bare pip installations may face issues on macOS 14+.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
19
Issues (30d)
64
Star History
337 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.