omlx  by jundot

LLM inference server for Apple Silicon

Created 4 weeks ago

New!

3,678 stars

Top 13.1% on SourcePulse

GitHubView on GitHub
Project Summary

An LLM inference server optimized for macOS on Apple Silicon, jundot/omlx provides continuous batching and a novel tiered KV cache (RAM + SSD) for efficient local model execution. It targets developers and power users seeking fine-grained control over LLM deployment, offering a convenient macOS menu bar interface alongside a robust API. The project aims to make local LLMs practical for demanding tasks by preserving and reusing context across requests, even after server restarts.

How It Works

oMLX leverages mlx-lm for inference, implementing continuous batching to handle concurrent requests efficiently. Its core innovation is a tiered KV cache: a hot in-memory tier for frequently accessed data and a cold SSD tier for offloading less active blocks. This block-based cache, inspired by vLLM and featuring Copy-on-Write, allows past context to remain cached and reusable on subsequent requests, significantly reducing recomputation and enabling practical use with large contexts. The server supports multi-model deployment, managing LLMs, VLMs, embeddings, and rerankers with intelligent eviction policies.

Quick Start & Requirements

  • Primary install: Download the .dmg from Releases, drag to Applications. Alternatively, use Homebrew: brew tap jundot/omlx && brew install omlx. Source install: git clone ... && cd omlx && pip install -e ..
  • Prerequisites: macOS 15.0+ (Sequoia), Python 3.10+, Apple Silicon (M1/M2/M3/M4).
  • Setup: macOS App installation is straightforward. Homebrew service auto-starts on crash. CLI: omlx serve --model-dir ~/models.
  • Links: Releases page for .dmg, GitHub repo for source.

Highlighted Details

  • Tiered KV Cache (RAM + SSD) for persistent, reusable context across requests and server restarts.
  • Continuous Batching for efficient handling of concurrent LLM, VLM, embedding, and reranker requests.
  • Native macOS Menu Bar App for seamless server management and monitoring without a terminal.
  • OpenAI and Anthropic API compatibility, serving as a drop-in replacement.
  • Vision-Language Model (VLM) support, including multi-image chat and OCR integration.
  • Admin Dashboard provides real-time monitoring, model management, chat interface, and benchmarking tools.

Maintenance & Community

The project is actively developed, with acknowledgments to inspirations like mlx-lm, vllm-mlx, and venvstacks. No specific community channels (e.g., Discord, Slack) or major corporate sponsorships are detailed in the provided README.

Licensing & Compatibility

Licensed under the Apache 2.0 license, permitting commercial use and integration into closed-source projects.

Limitations & Caveats

Strictly limited to macOS 15.0+ and Apple Silicon hardware. While configurable, careful management of memory limits (--max-model-memory, --max-process-memory) is crucial to prevent system instability.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
53
Issues (30d)
135
Star History
3,825 stars in the last 28 days

Explore Similar Projects

Starred by Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
11 more.

GPTCache by zilliztech

0.1%
8k
Semantic cache for LLM queries, integrated with LangChain and LlamaIndex
Created 3 years ago
Updated 8 months ago
Starred by Taranjeet Singh Taranjeet Singh(Cofounder of Mem0), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
4 more.

LMCache by LMCache

1.2%
8k
LLM serving engine extension for reduced TTFT and increased throughput
Created 1 year ago
Updated 20 hours ago
Feedback? Help us improve.