distributed-llama  by b4rtaz

CLI tool for distributed LLM inference across networked devices

Created 1 year ago
2,639 stars

Top 17.9% on SourcePulse

GitHubView on GitHub
Project Summary

This project enables distributed LLM inference by clustering home devices, targeting users with multiple machines who want to accelerate inference speed. It leverages tensor parallelism and high-speed synchronization over Ethernet, supporting Linux, macOS, and Windows with optimizations for ARM and x86_64 AVX2 CPUs.

How It Works

The architecture splits functionality between a root node and worker nodes. The root node manages model/weight loading, synchronization, and processes its own slice of the neural network. Worker nodes independently process their assigned network slices. This distributed approach allows RAM usage to be spread across all participating devices, with the root node requiring slightly more memory.

Quick Start & Requirements

  • Install: Clone the repository and compile using make dllama and make dllama-api.
  • Prerequisites: Python 3, C++ compiler (GCC/MinGW), Git.
  • Hardware: x86_64 AVX2 CPUs or ARM CPUs. Supports only 2^n nodes.
  • Models: Requires downloading model and tokenizer files separately.
  • Docs: How to Convert Llama 3.1, How to Convert Hugging Face Model

Highlighted Details

  • Supports Llama 3.1 8B, 3.2 1B/3B, 3.3 70B, and DeepSeek R1 Distill Llama 8B models with Q40 quantization.
  • Experimental Vulkan support added March 2025.
  • Recent codebase refactor merged February 2025.
  • Demonstrated Llama 3.3 70B on 4x Mac Mini M4 Pro (24GB RAM).

Maintenance & Community

  • Project maintained by Bartłomiej Tadych.
  • Open to contributions via merge requests or issues for larger changes.
  • Guidelines emphasize minimal, cross-system compatible changes.

Licensing & Compatibility

  • MIT License.
  • Permissive for commercial use and closed-source linking.

Limitations & Caveats

The system requires a specific number of nodes (2^n) and only supports certain quantization formats (q40 models with q80 buffer-float-type, f32 models with f32 buffer-float-type). Model conversion steps are necessary for custom models.

Health Check
Last Commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)
4
Issues (30d)
2
Star History
353 stars in the last 30 days

Explore Similar Projects

Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), Johannes Hagemann Johannes Hagemann(Cofounder of Prime Intellect), and
3 more.

minions by HazyResearch

1.3%
1k
Communication protocol for cost-efficient LLM collaboration
Created 7 months ago
Updated 16 hours ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Gabriel Almeida Gabriel Almeida(Cofounder of Langflow), and
2 more.

torchchat by pytorch

0.1%
4k
PyTorch-native SDK for local LLM inference across diverse platforms
Created 1 year ago
Updated 1 week ago
Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
11 more.

petals by bigscience-workshop

0.1%
10k
Run LLMs at home, BitTorrent-style
Created 3 years ago
Updated 1 year ago
Starred by Carol Willing Carol Willing(Core Contributor to CPython, Jupyter), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
9 more.

dynamo by ai-dynamo

1.0%
5k
Inference framework for distributed generative AI model serving
Created 6 months ago
Updated 13 hours ago
Feedback? Help us improve.