omlx by jundot

LLM inference server for Apple Silicon

Created 3 months ago

16,426 stars

Top 3.2% on SourcePulse

View on GitHub

3 Experts Love This Project

Cofounder of Platformatic

Project Summary

An LLM inference server optimized for macOS on Apple Silicon, jundot/omlx provides continuous batching and a novel tiered KV cache (RAM + SSD) for efficient local model execution. It targets developers and power users seeking fine-grained control over LLM deployment, offering a convenient macOS menu bar interface alongside a robust API. The project aims to make local LLMs practical for demanding tasks by preserving and reusing context across requests, even after server restarts.

How It Works

oMLX leverages mlx-lm for inference, implementing continuous batching to handle concurrent requests efficiently. Its core innovation is a tiered KV cache: a hot in-memory tier for frequently accessed data and a cold SSD tier for offloading less active blocks. This block-based cache, inspired by vLLM and featuring Copy-on-Write, allows past context to remain cached and reusable on subsequent requests, significantly reducing recomputation and enabling practical use with large contexts. The server supports multi-model deployment, managing LLMs, VLMs, embeddings, and rerankers with intelligent eviction policies.

Quick Start & Requirements

Primary install: Download the .dmg from Releases, drag to Applications. Alternatively, use Homebrew: brew tap jundot/omlx && brew install omlx. Source install: git clone ... && cd omlx && pip install -e ..
Prerequisites: macOS 15.0+ (Sequoia), Python 3.10+, Apple Silicon (M1/M2/M3/M4).
Setup: macOS App installation is straightforward. Homebrew service auto-starts on crash. CLI: omlx serve --model-dir ~/models.
Links: Releases page for .dmg, GitHub repo for source.

Highlighted Details

Tiered KV Cache (RAM + SSD) for persistent, reusable context across requests and server restarts.
Continuous Batching for efficient handling of concurrent LLM, VLM, embedding, and reranker requests.
Native macOS Menu Bar App for seamless server management and monitoring without a terminal.
OpenAI and Anthropic API compatibility, serving as a drop-in replacement.
Vision-Language Model (VLM) support, including multi-image chat and OCR integration.
Admin Dashboard provides real-time monitoring, model management, chat interface, and benchmarking tools.

Maintenance & Community

The project is actively developed, with acknowledgments to inspirations like mlx-lm, vllm-mlx, and venvstacks. No specific community channels (e.g., Discord, Slack) or major corporate sponsorships are detailed in the provided README.

Licensing & Compatibility

Licensed under the Apache 2.0 license, permitting commercial use and integration into closed-source projects.

Limitations & Caveats

Strictly limited to macOS 15.0+ and Apple Silicon hardware. While configurable, careful management of memory limits (--max-model-memory, --max-process-memory) is crucial to prevent system instability.

Health Check

Last Commit

16 hours ago

Responsiveness

Inactive

Pull Requests (30d)

237

Issues (30d)

326

Star History

2,556 stars in the last 30 days