Discover and explore top open-source AI tools and projects—updated daily.
jundotLLM inference server for Apple Silicon
New!
Top 13.1% on SourcePulse
An LLM inference server optimized for macOS on Apple Silicon, jundot/omlx provides continuous batching and a novel tiered KV cache (RAM + SSD) for efficient local model execution. It targets developers and power users seeking fine-grained control over LLM deployment, offering a convenient macOS menu bar interface alongside a robust API. The project aims to make local LLMs practical for demanding tasks by preserving and reusing context across requests, even after server restarts.
How It Works
oMLX leverages mlx-lm for inference, implementing continuous batching to handle concurrent requests efficiently. Its core innovation is a tiered KV cache: a hot in-memory tier for frequently accessed data and a cold SSD tier for offloading less active blocks. This block-based cache, inspired by vLLM and featuring Copy-on-Write, allows past context to remain cached and reusable on subsequent requests, significantly reducing recomputation and enabling practical use with large contexts. The server supports multi-model deployment, managing LLMs, VLMs, embeddings, and rerankers with intelligent eviction policies.
Quick Start & Requirements
.dmg from Releases, drag to Applications. Alternatively, use Homebrew: brew tap jundot/omlx && brew install omlx. Source install: git clone ... && cd omlx && pip install -e ..omlx serve --model-dir ~/models..dmg, GitHub repo for source.Highlighted Details
Maintenance & Community
The project is actively developed, with acknowledgments to inspirations like mlx-lm, vllm-mlx, and venvstacks. No specific community channels (e.g., Discord, Slack) or major corporate sponsorships are detailed in the provided README.
Licensing & Compatibility
Licensed under the Apache 2.0 license, permitting commercial use and integration into closed-source projects.
Limitations & Caveats
Strictly limited to macOS 15.0+ and Apple Silicon hardware. While configurable, careful management of memory limits (--max-model-memory, --max-process-memory) is crucial to prevent system instability.
1 day ago
Inactive
ScalingIntelligence
supermemoryai
zilliztech
LMCache
nomic-ai