vmlx by jjang-ai

Local AI engine for Apple Silicon

Created 2 months ago

414 stars

Top 70.4% on SourcePulse

Project Summary

Summary

vMLX is a local AI engine for Apple Silicon Macs, enabling users to run LLMs, VLMs, and image generation models entirely on-device. It provides an OpenAI, Anthropic, and Ollama-compatible API, ensuring privacy and eliminating cloud dependencies. This offers a high-performance, local inference solution for developers and power users.

How It Works

Built on Apple's MLX framework, vMLX optimizes inference for Metal GPUs. Its key innovation is "JANG" adaptive mixed-precision quantization, achieving superior accuracy at lower bitwidths compared to standard MLX quantization. A sophisticated 5-layer caching architecture (continuous batching, prefix cache, paged KV cache, disk cache) drastically reduces latency. For models exceeding single-machine capacity, vMLX supports pipeline parallelism across multiple Macs.

Quick Start & Requirements

Install via pip install vmlx or uv tool install vmlx. Serve a model with vmlx serve mlx-community/Qwen3-8B-4bit. Requires Apple Silicon hardware. macOS 14+ users should use uv, pipx, or a virtual environment for installation. MLX Studio, a native macOS GUI app, is also available.

Highlighted Details

JANG Quantization: Advanced adaptive mixed-precision (e.g., JANG_2L) offers state-of-the-art compression and accuracy.
Advanced Caching: Features continuous batching, prefix cache, paged KV cache, KV cache quantization (q4/q8), and disk caching for rapid responses.
Distributed Inference: Enables pipeline parallelism across multiple Macs for scaling model capacity.
API Compatibility: Seamlessly integrates with OpenAI, Anthropic, and Ollama SDKs/CLIs.
MLX Studio: Native macOS application with chat, model management, image generation, and developer tools.
Smelt Mode: Allows running extremely large MoE models by loading subsets from SSD, reducing RAM needs.

Maintenance & Community

Developed by Jinho Jang (JANGQ AI). While specific community channels are not detailed, the project appears actively maintained.

Licensing & Compatibility

Distributed under the permissive Apache License 2.0, permitting commercial use and integration into closed-source applications.

Limitations & Caveats

Primarily optimized for Apple Silicon Macs. Smelt mode is mutually exclusive with VLM capabilities and requires JANG-formatted MoE models. Some features, like Qwen Image Edit, have high RAM requirements (~54 GB). Bare pip installations may face issues on macOS 14+.

vmlx by jjang-ai

Explore Similar Projects

gpt_server by shell-nlp

ultra-fast-image-gen by newideas99

OpenArc by SearchSavior

Kolosal by KolosalAI

mlxstudio by jjang-ai

aiperf by ai-dynamo

Rapid-MLX by raullenchai

mlx-engine by lmstudio-ai

MaixPy by sipeed

vllm-mlx by waybarrios

onnxruntime-genai by microsoft

cua by trycua