cake  by evilsocket

Rust framework for distributed LLM/SD inference across diverse hardware

created 1 year ago
2,881 stars

Top 16.9% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a distributed inference framework for large language models (LLMs) and Stable Diffusion, targeting consumer hardware across mobile, desktop, and server platforms. It aims to democratize AI by enabling users to leverage heterogeneous clusters of devices, including iOS, Android, macOS, Linux, and Windows, to run models that exceed single-device memory capacities.

How It Works

Cake shards transformer blocks across multiple worker devices, minimizing latency through batching of contiguous blocks on the same worker. This approach allows for the execution of large models by distributing their computational and memory load across a network of devices. The framework utilizes Rust and the Candle library, offering flexibility in acceleration through CPU, CUDA, and Metal.

Quick Start & Requirements

  • Installation: Build from source using cargo build --release. Features can be enabled with --features metal or --features cuda. iOS bindings are generated via make ios.
  • Prerequisites: Rust toolchain. CUDA >= 12.2 is required for CUDA acceleration.
  • Running:
    • Worker: cake-cli --model <path> --mode worker --name <name> --topology <file> --address <addr>
    • Master (OpenAI-compatible API): cake-cli --model <path> --api <addr> --topology <file>
  • Resources: Model files need to be managed and potentially split using cake-split-model.
  • Docs: README

Highlighted Details

  • Supports distributed inference for LLMs (e.g., Llama3) and image models (e.g., Stable Diffusion).
  • Enables sharding of models across heterogeneous hardware, including mobile devices.
  • Provides an OpenAI-compatible REST API for easy integration.
  • Includes a utility (cake-split-model) for optimizing model data per worker.

Maintenance & Community

  • Actively developed and experimental.
  • Community support available via Discord (link in README).

Licensing & Compatibility

  • License: GPL 3.0.
  • Compatibility: GPL 3.0 is a strong copyleft license, potentially restricting integration with closed-source applications.

Limitations & Caveats

The project is explicitly experimental and undergoing rapid development, with potential for bugs and breaking changes. CUDA acceleration on Android is listed as untested.

Health Check
Last commit

9 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
48 stars in the last 90 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake), and
2 more.

gpustack by gpustack

1.6%
3k
GPU cluster manager for AI model deployment
created 1 year ago
updated 3 days ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Ying Sheng Ying Sheng(Author of SGLang).

fastllm by ztxz16

0.4%
4k
High-performance C++ LLM inference library
created 2 years ago
updated 2 weeks ago
Feedback? Help us improve.