shard by leyten

Pipeline-parallel LLM inference across distributed machines

Created 2 weeks ago

New!

399 stars

Top 72.0% on SourcePulse

Project Summary

Summary

Shard addresses the challenge of running Large Language Models (LLMs) that exceed single-GPU memory capacity by implementing pipeline parallelism across distributed machines, including over wide-area networks (WANs). It targets researchers and power users seeking to deploy frontier-sized models without requiring a dedicated datacenter or single high-end host. The primary benefit is enabling usable inference speeds for massive models distributed across geographically scattered, potentially heterogeneous GPUs.

How It Works

Shard partitions transformer layers into contiguous blocks, assigning each block to a separate GPU (a "shard"). Activations are streamed sequentially through these shards. To overcome WAN latency, it employs a multi-pronged strategy: speculative decoding using a small, CUDA-graphed "draft" model to propose multiple tokens, which are then verified by the larger distributed model in a single pipeline pass. Async pipelining overlaps multiple verification passes, and a direct-return ring topology minimizes communication hops. This approach shifts the bottleneck from latency to throughput, making WAN inference viable.

Quick Start & Requirements

Primary install/run: Not explicitly detailed; implies a "one command" join for participants.
Prerequisites: Requires GPUs (tested with RTX PRO 6000, RTX 4090), potentially across different machines and networks. Specific CUDA versions are not stated but implied by GPU requirements.
Documentation: Links to docs/ARCHITECTURE.md, docs/ROADMAP.md, docs/PROOF.md, and example receipts (docs/receipts/).

Highlighted Details

Achieves ~30 tok/s for a 744B parameter GLM-5.2 model across seven prosumer GPUs in six US states over WAN.
Serves a 120B parameter GPT-OSS model at ~40 tok/s across three consumer GPUs in different US states over WAN.
Generates verifiable receipts detailing hardware, network conditions (RTTs), and output hashes for transparency and auditability.
Designed with principles of decentralization (anyone can join), privacy (no single node holds the full model), and uncensored inference.

Maintenance & Community

The project is associated with "c0mpute" infrastructure. Specific community channels (Discord, Slack) or active maintainer lists are not detailed in the README. A roadmap is available at docs/ROADMAP.md.

Licensing & Compatibility

Licensed under the Apache License 2.0. This license is generally permissive and compatible with commercial use and closed-source linking.

Limitations & Caveats

While designed for privacy, intermediate activations processed by a participating node can still leak partial token information to a malicious node; mitigating this is an open problem. Ongoing development includes Phase 1 (NAT traversal, quantization) and Phase 3 (permissionless swarm, dynamic allocation). The current WAN transport relies on direct open ports, with NAT hole-punching and relay fallback planned for Phase 1.

shard by leyten

Explore Similar Projects

ntransformer by xaskasdf

eLLM by lucienhuangfu

sarathi-serve by microsoft

prima.cpp by Lizonghang

smg by lightseekorg

parallax by GradientHQ

mesh-llm by Mesh-LLM

minions by HazyResearch

distributed-llama by b4rtaz

spark-vllm-docker by eugr

aibrix by vllm-project

dynamo by ai-dynamo