cake by evilsocket

Rust framework for distributed LLM/SD inference across diverse hardware

Created 1 year ago

2,905 stars

Top 16.3% on SourcePulse

View on GitHub

2 Experts Love This Project

Tim Suchanek

Founder of expand.ai

Georgios Konstantopoulos

CTO, General Partner at Paradigm

Project Summary

This project provides a distributed inference framework for large language models (LLMs) and Stable Diffusion, targeting consumer hardware across mobile, desktop, and server platforms. It aims to democratize AI by enabling users to leverage heterogeneous clusters of devices, including iOS, Android, macOS, Linux, and Windows, to run models that exceed single-device memory capacities.

How It Works

Cake shards transformer blocks across multiple worker devices, minimizing latency through batching of contiguous blocks on the same worker. This approach allows for the execution of large models by distributing their computational and memory load across a network of devices. The framework utilizes Rust and the Candle library, offering flexibility in acceleration through CPU, CUDA, and Metal.

Quick Start & Requirements

Installation: Build from source using cargo build --release. Features can be enabled with --features metal or --features cuda. iOS bindings are generated via make ios.
Prerequisites: Rust toolchain. CUDA >= 12.2 is required for CUDA acceleration.
Running:
- Worker: cake-cli --model <path> --mode worker --name <name> --topology <file> --address <addr>
- Master (OpenAI-compatible API): cake-cli --model <path> --api <addr> --topology <file>
Resources: Model files need to be managed and potentially split using cake-split-model.
Docs: README

Highlighted Details

Supports distributed inference for LLMs (e.g., Llama3) and image models (e.g., Stable Diffusion).
Enables sharding of models across heterogeneous hardware, including mobile devices.
Provides an OpenAI-compatible REST API for easy integration.
Includes a utility (cake-split-model) for optimizing model data per worker.

Maintenance & Community

Actively developed and experimental.
Community support available via Discord (link in README).

Licensing & Compatibility

License: GPL 3.0.
Compatibility: GPL 3.0 is a strong copyleft license, potentially restricting integration with closed-source applications.

Limitations & Caveats

The project is explicitly experimental and undergoing rapid development, with potential for bugs and breaking changes. CUDA acceleration on Android is listed as untested.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

13 stars in the last 30 days