mlx-openai-server  by cubist38

OpenAI-compatible API server for local MLX model inference

Created 11 months ago
257 stars

Top 98.2% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides a high-performance, OpenAI-compatible API server for MLX models, enabling developers to run text, vision, audio, and image generation models locally on Apple Silicon hardware. It serves as a drop-in replacement for OpenAI services, offering local control, enhanced privacy, and optimized performance for MLX-based AI workloads. The target audience includes engineers and researchers who need to integrate local ML models into existing applications or experiment with advanced AI capabilities without relying on external cloud APIs.

How It Works

The server leverages Python and the FastAPI framework to expose OpenAI-compatible endpoints. A core architectural decision for multi-model deployment involves spawning each model handler in a separate subprocess using multiprocessing.get_context("spawn"). This approach isolates MLX Metal/GPU contexts, preventing semaphore leaks that can occur with traditional fork processes on macOS, thereby ensuring stability and efficient resource management. Requests are proxied between the main FastAPI process and these dedicated child handler processes.

Quick Start & Requirements

  • Prerequisites: macOS with an M-series chip, Python 3.11+. ffmpeg is required for audio transcription (brew install ffmpeg).
  • Installation: Install via pip: pip install mlx-openai-server.
  • Quick Start: Launch a model with mlx-openai-server launch --model-path <path> --model-type <type>. For example, mlx-openai-server launch --model-path mlx-community/Qwen3-Coder-Next-4bit --model-type lm.
  • Documentation: Comprehensive examples are available in the examples/ directory.

Highlighted Details

  • OpenAI-Compatible API: Seamless integration with existing OpenAI client libraries.
  • Multimodal Support: Handles text, vision, audio transcription, and image generation/editing tasks.
  • Performance Optimizations: Supports configurable quantization (4/8/16-bit), speculative decoding for LLMs, and LoRA adapters for image models.
  • Multi-Model Serving: Run multiple models concurrently via a YAML configuration file, routing requests by model_id.
  • Request Queue System: Built-in system for managing and monitoring incoming requests.

Maintenance & Community

Contributions are welcomed via pull requests following Conventional Commits. Support and discussions are primarily handled through GitHub Issues and Discussions. The project is built upon the MLX framework and related MLX libraries.

Licensing & Compatibility

The project is released under the MIT License, permitting broad use, including commercial applications. It is designed for compatibility with standard OpenAI client SDKs.

Limitations & Caveats

This project is strictly limited to macOS with M-series Apple Silicon chips. Users may encounter memory issues with large models, which can be mitigated through quantization or reduced context lengths. Metal/semaphore warnings, a known issue with MLX on macOS, are addressed by the multi-handler process isolation.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
25
Issues (30d)
12
Star History
41 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.