tiny-llm  by skyzh

LLM inference tutorial for systems engineers on Apple Silicon

Created 5 months ago
3,198 stars

Top 15.0% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides a tutorial and codebase for serving Large Language Models (LLMs) on Apple Silicon using the MLX framework. It targets systems engineers and researchers interested in understanding and optimizing LLM inference from the ground up, by building serving infrastructure using low-level MLX array APIs rather than high-level libraries.

How It Works

The project focuses on implementing LLM components like attention mechanisms, RoPE, and normalization layers directly with MLX array operations. This approach allows for deep dives into optimization techniques, such as quantized matrix multiplications and efficient KV caching, tailored for Apple Silicon's Metal Performance Shaders. The goal is to demystify LLM serving by building it from fundamental building blocks.

Quick Start & Requirements

  • Install: MLX is the primary dependency. Installation instructions are available in the MLX documentation.
  • Prerequisites: macOS, Apple Silicon (M1/M2/M3 series).
  • Resources: The project is designed for local development on macOS, making it accessible without specialized hardware beyond an Apple Silicon Mac.
  • Docs: The accompanying book is available at https://skyzh.github.io/tiny-llm/.

Highlighted Details

  • Focuses on building LLM serving infrastructure from scratch using MLX array APIs.
  • Targets optimization techniques specific to Apple Silicon hardware.
  • Covers fundamental LLM components and advanced serving strategies like continuous batching and speculative decoding.
  • Aims to provide a comprehensive understanding of LLM inference serving.

Maintenance & Community

The project is marked as "WIP" (Work In Progress) with a detailed roadmap indicating ongoing development. A Discord server is available for community engagement and study.

Licensing & Compatibility

The project's license is not explicitly stated in the provided README snippet. Compatibility is limited to macOS on Apple Silicon hardware.

Limitations & Caveats

The project is in a very early stage ("WIP") with many components and features still under development, as indicated by the roadmap's "🚧" markers. The codebase relies exclusively on MLX, limiting its applicability to users within the Apple Silicon ecosystem.

Health Check
Last Commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)
18
Issues (30d)
2
Star History
317 stars in the last 30 days

Explore Similar Projects

Starred by Didier Lopes Didier Lopes(Founder of OpenBB), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
5 more.

mlx-lm by ml-explore

26.1%
2k
Python package for LLM text generation and fine-tuning on Apple silicon
Created 6 months ago
Updated 22 hours ago
Starred by Wing Lian Wing Lian(Founder of Axolotl AI) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

airllm by lyogavin

0.1%
6k
Inference optimization for LLMs on low-resource hardware
Created 2 years ago
Updated 2 weeks ago
Starred by Lianmin Zheng Lianmin Zheng(Coauthor of SGLang, vLLM), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
1 more.

MiniCPM by OpenBMB

0.4%
8k
Ultra-efficient LLMs for end devices, achieving 5x+ speedup
Created 1 year ago
Updated 1 week ago
Feedback? Help us improve.