Discover and explore top open-source AI tools and projects—updated daily.
Mininglamp-AILLM inference acceleration for Apple Silicon
New!
Top 80.9% on SourcePulse
Summary
Cider accelerates LLM inference on Apple Silicon by leveraging underutilized INT8 TensorOps. It provides MLX custom primitives and Metal kernels for W8A8 and W4A8 quantization, enabling significantly faster LLM prefill (1.2–1.9×) and reduced memory usage for macOS users.
How It Works
Built on MLX, Cider implements W8A8 and W4A8 quantization via custom primitives and Metal kernels. It utilizes Apple's mpp::tensor_ops::matmul2d for INT8×INT8→INT32 matrix multiplication during prefill (M>1), fused with activation quantization and weight dequantization. Decoding (M=1) uses optimized INT8 matrix-vector kernels. Conditional compilation enables full C++ Metal builds on M5+; M4 and below fall back to pure-Python.
Quick Start & Requirements
pip install -e .is_available() returns False). Python 3.12+, MLX >= 0.31. nanobind and CMake needed for M5+ C++ builds.vlm_service/) and integration notes for mlx_vlm.Highlighted Details
vlm_service/) with automatic W8A8 acceleration switching between prefill and decode.Maintenance & Community
Developed by Mininglamp Technology's Multimodal Team. Issues should be submitted via GitHub. No explicit community channels are listed.
Licensing & Compatibility
Released under the MIT License, permitting commercial use and integration into closed-source projects.
Limitations & Caveats
Full INT8 TensorOps acceleration is M5+ exclusive; M4 and below offer reduced functionality. The M=1 per-channel MV kernel can be slower than MLX W4A16 in isolation, and W4A8 incurs INT4→INT8 unpacking overhead. Experimental ANE+GPU is M4-focused and needs lazy evaluation integration. VLM quantization requires careful application to avoid accuracy loss.
1 week ago
Inactive
MDK8888
mit-han-lab
Tiiny-AI
lyogavin