distributed-llama by b4rtaz

CLI tool for distributed LLM inference across networked devices

Created 2 years ago

2,785 stars

Top 16.9% on SourcePulse

View on GitHub

2 Experts Love This Project

Gabriel Almeida

Cofounder of Langflow

Georgi Gerganov

Author of llama.cpp, whisper.cpp

Project Summary

This project enables distributed LLM inference by clustering home devices, targeting users with multiple machines who want to accelerate inference speed. It leverages tensor parallelism and high-speed synchronization over Ethernet, supporting Linux, macOS, and Windows with optimizations for ARM and x86_64 AVX2 CPUs.

How It Works

The architecture splits functionality between a root node and worker nodes. The root node manages model/weight loading, synchronization, and processes its own slice of the neural network. Worker nodes independently process their assigned network slices. This distributed approach allows RAM usage to be spread across all participating devices, with the root node requiring slightly more memory.

Quick Start & Requirements

Install: Clone the repository and compile using make dllama and make dllama-api.
Prerequisites: Python 3, C++ compiler (GCC/MinGW), Git.
Hardware: x86_64 AVX2 CPUs or ARM CPUs. Supports only 2^n nodes.
Models: Requires downloading model and tokenizer files separately.
Docs: How to Convert Llama 3.1, How to Convert Hugging Face Model

Highlighted Details

Supports Llama 3.1 8B, 3.2 1B/3B, 3.3 70B, and DeepSeek R1 Distill Llama 8B models with Q40 quantization.
Experimental Vulkan support added March 2025.
Recent codebase refactor merged February 2025.
Demonstrated Llama 3.3 70B on 4x Mac Mini M4 Pro (24GB RAM).

Maintenance & Community

Project maintained by Bartłomiej Tadych.
Open to contributions via merge requests or issues for larger changes.
Guidelines emphasize minimal, cross-system compatible changes.

Licensing & Compatibility

MIT License.
Permissive for commercial use and closed-source linking.

Limitations & Caveats

The system requires a specific number of nodes (2^n) and only supports certain quantization formats (q40 models with q80 buffer-float-type, f32 models with f32 buffer-float-type). Model conversion steps are necessary for custom models.

Health Check

Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

28 stars in the last 30 days