Rapid-MLX by raullenchai

Fast local AI engine for Apple Silicon

Created 2 months ago

331 stars

Top 82.8% on SourcePulse

Project Summary

Rapid-MLX: High-Performance Local AI Engine for Apple Silicon

Rapid-MLX provides a highly optimized local AI inference engine specifically for Apple Silicon Macs, aiming to deliver unparalleled speed and efficiency. It serves as a drop-in replacement for OpenAI's API, enabling users to run advanced AI models directly on their hardware without cloud dependencies or API costs. The primary benefit is significantly faster inference speeds and lower latency compared to other local solutions, making powerful AI accessible for development, research, and everyday use.

How It Works

The engine is built upon Apple's MLX framework, leveraging its native Metal compute kernels and unified memory architecture for maximum performance on Apple Silicon. Key innovations include advanced prompt caching techniques, such as KV cache trimming for standard transformers and novel DeltaNet state snapshots for hybrid RNN models, drastically reducing Time To First Token (TTFT). It also features reasoning separation for chain-of-thought models and robust tool calling support with 17 parsers and automatic output recovery, ensuring reliable integration with AI agents and applications.

Quick Start & Requirements

Installation is straightforward via Homebrew (brew install raullenchai/rapid-mlx/rapid-mlx), pip (pip install rapid-mlx), or a convenient one-liner script. A Mac with Apple Silicon is required. Python 3.10+ is recommended for pip installations. Optional dependencies like rapid-mlx[vision] or rapid-mlx[audio] enable multimodal capabilities. Model RAM usage varies significantly, from ~2.4 GB for smaller models to over 30 GB for larger ones.

Highlighted Details

Speed: Up to 4.2x faster than Ollama on Apple Silicon, with cached TTFT as low as 0.08s.
Tool Calling: 100% support for many models, featuring 17 parsers and automatic recovery for malformed outputs.
OpenAI API Compatibility: Seamlessly integrates with any application designed for the OpenAI API.
Advanced Features: Includes prompt caching, reasoning separation, multimodal vision/audio support, and smart cloud routing for large context requests.

Maintenance & Community

The project is actively developed with a roadmap outlining future performance enhancements. Contributions are welcomed, with guidelines provided in CONTRIBUTING.md. Specific community channels like Discord or Slack are not detailed in the README.

Licensing & Compatibility

Rapid-MLX is distributed under the Apache 2.0 license, permitting commercial use and integration into closed-source applications.

Limitations & Caveats

The engine is exclusively designed for Apple Silicon Macs. Performance is highly dependent on the host machine's RAM, with larger models potentially causing slowdowns. Vision and audio features require separate installations. Warnings regarding "parameters not found" are normal for multimodal models and do not indicate an issue.

Rapid-MLX by raullenchai

Explore Similar Projects

SwiftAI by mi12labs

OpenArc by SearchSavior

vmlx by jjang-ai

mlx-openai-server by cubist38

MaixPy by sipeed

FlagPerf by flagos-ai

LitServe by Lightning-AI

cookbook by Liquid4All

shimmy by Michael-A-Kuykendall

OpenJarvis by open-jarvis

gallery by google-ai-edge

openvino by openvinotoolkit