distributed-llama  by b4rtaz

CLI tool for distributed LLM inference across networked devices

created 1 year ago
2,232 stars

Top 20.7% on sourcepulse

GitHubView on GitHub
Project Summary

This project enables distributed LLM inference by clustering home devices, targeting users with multiple machines who want to accelerate inference speed. It leverages tensor parallelism and high-speed synchronization over Ethernet, supporting Linux, macOS, and Windows with optimizations for ARM and x86_64 AVX2 CPUs.

How It Works

The architecture splits functionality between a root node and worker nodes. The root node manages model/weight loading, synchronization, and processes its own slice of the neural network. Worker nodes independently process their assigned network slices. This distributed approach allows RAM usage to be spread across all participating devices, with the root node requiring slightly more memory.

Quick Start & Requirements

  • Install: Clone the repository and compile using make dllama and make dllama-api.
  • Prerequisites: Python 3, C++ compiler (GCC/MinGW), Git.
  • Hardware: x86_64 AVX2 CPUs or ARM CPUs. Supports only 2^n nodes.
  • Models: Requires downloading model and tokenizer files separately.
  • Docs: How to Convert Llama 3.1, How to Convert Hugging Face Model

Highlighted Details

  • Supports Llama 3.1 8B, 3.2 1B/3B, 3.3 70B, and DeepSeek R1 Distill Llama 8B models with Q40 quantization.
  • Experimental Vulkan support added March 2025.
  • Recent codebase refactor merged February 2025.
  • Demonstrated Llama 3.3 70B on 4x Mac Mini M4 Pro (24GB RAM).

Maintenance & Community

  • Project maintained by Bartłomiej Tadych.
  • Open to contributions via merge requests or issues for larger changes.
  • Guidelines emphasize minimal, cross-system compatible changes.

Licensing & Compatibility

  • MIT License.
  • Permissive for commercial use and closed-source linking.

Limitations & Caveats

The system requires a specific number of nodes (2^n) and only supports certain quantization formats (q40 models with q80 buffer-float-type, f32 models with f32 buffer-float-type). Model conversion steps are necessary for custom models.

Health Check
Last commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)
3
Issues (30d)
3
Star History
201 stars in the last 90 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), Julien Chaumond Julien Chaumond(Cofounder of Hugging Face), and
1 more.

parallelformers by tunib-ai

0%
790
Toolkit for easy model parallelization
created 4 years ago
updated 2 years ago
Starred by George Hotz George Hotz(Author of tinygrad; Founder of the tiny corp, comma.ai), Anton Bukov Anton Bukov(Cofounder of 1inch Network), and
16 more.

tinygrad by tinygrad

0.1%
30k
Minimalist deep learning framework for education and exploration
created 4 years ago
updated 19 hours ago
Feedback? Help us improve.