CLI tool for distributed LLM inference across networked devices
Top 20.7% on sourcepulse
This project enables distributed LLM inference by clustering home devices, targeting users with multiple machines who want to accelerate inference speed. It leverages tensor parallelism and high-speed synchronization over Ethernet, supporting Linux, macOS, and Windows with optimizations for ARM and x86_64 AVX2 CPUs.
How It Works
The architecture splits functionality between a root node and worker nodes. The root node manages model/weight loading, synchronization, and processes its own slice of the neural network. Worker nodes independently process their assigned network slices. This distributed approach allows RAM usage to be spread across all participating devices, with the root node requiring slightly more memory.
Quick Start & Requirements
make dllama
and make dllama-api
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The system requires a specific number of nodes (2^n) and only supports certain quantization formats (q40 models with q80 buffer-float-type, f32 models with f32 buffer-float-type). Model conversion steps are necessary for custom models.
2 days ago
1 day