cider  by Mininglamp-AI

LLM inference acceleration for Apple Silicon

Created 3 weeks ago

New!

341 stars

Top 80.9% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

Cider accelerates LLM inference on Apple Silicon by leveraging underutilized INT8 TensorOps. It provides MLX custom primitives and Metal kernels for W8A8 and W4A8 quantization, enabling significantly faster LLM prefill (1.2–1.9×) and reduced memory usage for macOS users.

How It Works

Built on MLX, Cider implements W8A8 and W4A8 quantization via custom primitives and Metal kernels. It utilizes Apple's mpp::tensor_ops::matmul2d for INT8×INT8→INT32 matrix multiplication during prefill (M>1), fused with activation quantization and weight dequantization. Decoding (M=1) uses optimized INT8 matrix-vector kernels. Conditional compilation enables full C++ Metal builds on M5+; M4 and below fall back to pure-Python.

Quick Start & Requirements

  • Install: pip install -e .
  • Prerequisites: Apple M5+ required for full INT8 TensorOps acceleration; M4 and below install as pure-Python (is_available() returns False). Python 3.12+, MLX >= 0.31. nanobind and CMake needed for M5+ C++ builds.
  • Resources: Includes example VLM inference server (vlm_service/) and integration notes for mlx_vlm.

Highlighted Details

  • Achieves 1.2–1.9× faster LLM prefill on M5+ using W8A8 INT8 TensorOps, with benchmarks showing significant speedups over MLX's native FP16/BF16.
  • Offers a ready-to-use OpenAI-style VLM inference server (vlm_service/) with automatic W8A8 acceleration switching between prefill and decode.
  • Features experimental ANE+GPU heterogeneous tensor parallelism on M4, splitting GEMM operations between GPU and ANE for potential speedups, though currently limited by synchronization overhead.
  • Supports W8A8 (per-channel, per-group) and W4A8 quantization, balancing speed, memory, and precision.

Maintenance & Community

Developed by Mininglamp Technology's Multimodal Team. Issues should be submitted via GitHub. No explicit community channels are listed.

Licensing & Compatibility

Released under the MIT License, permitting commercial use and integration into closed-source projects.

Limitations & Caveats

Full INT8 TensorOps acceleration is M5+ exclusive; M4 and below offer reduced functionality. The M=1 per-channel MV kernel can be slower than MLX W4A16 in isolation, and W4A8 incurs INT4→INT8 unpacking overhead. Experimental ANE+GPU is M4-focused and needs lazy evaluation integration. VLM quantization requires careful application to avoid accuracy loss.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
350 stars in the last 27 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Maxime Labonne Maxime Labonne(Head of Post-Training at Liquid AI), and
1 more.

GPTFast by MDK8888

0%
683
HF Transformers accelerator for faster inference
Created 2 years ago
Updated 1 year ago
Starred by Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
7 more.

llm-awq by mit-han-lab

0.1%
4k
Weight quantization research paper for LLM compression/acceleration
Created 3 years ago
Updated 10 months ago
Starred by Wing Lian Wing Lian(Founder of Axolotl AI) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

airllm by lyogavin

1.1%
18k
Inference optimization for LLMs on low-resource hardware
Created 2 years ago
Updated 2 months ago
Feedback? Help us improve.