Rapid-MLX  by raullenchai

Fast local AI engine for Apple Silicon

Created 2 months ago
331 stars

Top 82.8% on SourcePulse

GitHubView on GitHub
Project Summary

Rapid-MLX: High-Performance Local AI Engine for Apple Silicon

Rapid-MLX provides a highly optimized local AI inference engine specifically for Apple Silicon Macs, aiming to deliver unparalleled speed and efficiency. It serves as a drop-in replacement for OpenAI's API, enabling users to run advanced AI models directly on their hardware without cloud dependencies or API costs. The primary benefit is significantly faster inference speeds and lower latency compared to other local solutions, making powerful AI accessible for development, research, and everyday use.

How It Works

The engine is built upon Apple's MLX framework, leveraging its native Metal compute kernels and unified memory architecture for maximum performance on Apple Silicon. Key innovations include advanced prompt caching techniques, such as KV cache trimming for standard transformers and novel DeltaNet state snapshots for hybrid RNN models, drastically reducing Time To First Token (TTFT). It also features reasoning separation for chain-of-thought models and robust tool calling support with 17 parsers and automatic output recovery, ensuring reliable integration with AI agents and applications.

Quick Start & Requirements

Installation is straightforward via Homebrew (brew install raullenchai/rapid-mlx/rapid-mlx), pip (pip install rapid-mlx), or a convenient one-liner script. A Mac with Apple Silicon is required. Python 3.10+ is recommended for pip installations. Optional dependencies like rapid-mlx[vision] or rapid-mlx[audio] enable multimodal capabilities. Model RAM usage varies significantly, from ~2.4 GB for smaller models to over 30 GB for larger ones.

Highlighted Details

  • Speed: Up to 4.2x faster than Ollama on Apple Silicon, with cached TTFT as low as 0.08s.
  • Tool Calling: 100% support for many models, featuring 17 parsers and automatic recovery for malformed outputs.
  • OpenAI API Compatibility: Seamlessly integrates with any application designed for the OpenAI API.
  • Advanced Features: Includes prompt caching, reasoning separation, multimodal vision/audio support, and smart cloud routing for large context requests.

Maintenance & Community

The project is actively developed with a roadmap outlining future performance enhancements. Contributions are welcomed, with guidelines provided in CONTRIBUTING.md. Specific community channels like Discord or Slack are not detailed in the README.

Licensing & Compatibility

Rapid-MLX is distributed under the Apache 2.0 license, permitting commercial use and integration into closed-source applications.

Limitations & Caveats

The engine is exclusively designed for Apple Silicon Macs. Performance is highly dependent on the host machine's RAM, with larger models potentially causing slowdowns. Vision and audio features require separate installations. Warnings regarding "parameters not found" are normal for multimodal models and do not indicate an issue.

Health Check
Last Commit

5 days ago

Responsiveness

Inactive

Pull Requests (30d)
58
Issues (30d)
44
Star History
251 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Luis Capelo Luis Capelo(Cofounder of Lightning AI), and
3 more.

LitServe by Lightning-AI

0.2%
4k
AI inference pipeline framework
Created 2 years ago
Updated 2 weeks ago
Feedback? Help us improve.