cider by Mininglamp-AI

LLM inference acceleration for Apple Silicon

Created 2 months ago

333 stars

Top 82.0% on SourcePulse

Project Summary

Summary

Cider accelerates LLM inference on Apple Silicon by leveraging underutilized INT8 TensorOps. It provides MLX custom primitives and Metal kernels for W8A8 and W4A8 quantization, enabling significantly faster LLM prefill (1.2–1.9×) and reduced memory usage for macOS users.

How It Works

Built on MLX, Cider implements W8A8 and W4A8 quantization via custom primitives and Metal kernels. It utilizes Apple's mpp::tensor_ops::matmul2d for INT8×INT8→INT32 matrix multiplication during prefill (M>1), fused with activation quantization and weight dequantization. Decoding (M=1) uses optimized INT8 matrix-vector kernels. Conditional compilation enables full C++ Metal builds on M5+; M4 and below fall back to pure-Python.

Quick Start & Requirements

Install: pip install -e .
Prerequisites: Apple M5+ required for full INT8 TensorOps acceleration; M4 and below install as pure-Python (is_available() returns False). Python 3.12+, MLX >= 0.31. nanobind and CMake needed for M5+ C++ builds.
Resources: Includes example VLM inference server (vlm_service/) and integration notes for mlx_vlm.

Highlighted Details

Achieves 1.2–1.9× faster LLM prefill on M5+ using W8A8 INT8 TensorOps, with benchmarks showing significant speedups over MLX's native FP16/BF16.
Offers a ready-to-use OpenAI-style VLM inference server (vlm_service/) with automatic W8A8 acceleration switching between prefill and decode.
Features experimental ANE+GPU heterogeneous tensor parallelism on M4, splitting GEMM operations between GPU and ANE for potential speedups, though currently limited by synchronization overhead.
Supports W8A8 (per-channel, per-group) and W4A8 quantization, balancing speed, memory, and precision.

Maintenance & Community

Developed by Mininglamp Technology's Multimodal Team. Issues should be submitted via GitHub. No explicit community channels are listed.

Licensing & Compatibility

Released under the MIT License, permitting commercial use and integration into closed-source projects.

Limitations & Caveats

Full INT8 TensorOps acceleration is M5+ exclusive; M4 and below offer reduced functionality. The M=1 per-channel MV kernel can be slower than MLX W4A16 in isolation, and W4A8 incurs INT4→INT8 unpacking overhead. Experimental ANE+GPU is M4-focused and needs lazy evaluation integration. VLM quantization requires careful application to avoid accuracy loss.

cider by Mininglamp-AI

Explore Similar Projects

fp6_llm by usyd-fsalab

calm by zeux

Lvllm by guqiong96

neural-speed by intel

GPTFast by MDK8888

marlin by IST-DASLab

GPTQModel by ModelCloud

llm-awq by mit-han-lab

intel-extension-for-pytorch by intel

ik_llama.cpp by ikawrakow

PowerInfer by Tiiny-AI

airllm by lyogavin