Discover and explore top open-source AI tools and projects—updated daily.
leytenPipeline-parallel LLM inference across distributed machines
New!
Top 72.0% on SourcePulse
Summary
Shard addresses the challenge of running Large Language Models (LLMs) that exceed single-GPU memory capacity by implementing pipeline parallelism across distributed machines, including over wide-area networks (WANs). It targets researchers and power users seeking to deploy frontier-sized models without requiring a dedicated datacenter or single high-end host. The primary benefit is enabling usable inference speeds for massive models distributed across geographically scattered, potentially heterogeneous GPUs.
How It Works
Shard partitions transformer layers into contiguous blocks, assigning each block to a separate GPU (a "shard"). Activations are streamed sequentially through these shards. To overcome WAN latency, it employs a multi-pronged strategy: speculative decoding using a small, CUDA-graphed "draft" model to propose multiple tokens, which are then verified by the larger distributed model in a single pipeline pass. Async pipelining overlaps multiple verification passes, and a direct-return ring topology minimizes communication hops. This approach shifts the bottleneck from latency to throughput, making WAN inference viable.
Quick Start & Requirements
docs/ARCHITECTURE.md, docs/ROADMAP.md, docs/PROOF.md, and example receipts (docs/receipts/).Highlighted Details
Maintenance & Community
The project is associated with "c0mpute" infrastructure. Specific community channels (Discord, Slack) or active maintainer lists are not detailed in the README. A roadmap is available at docs/ROADMAP.md.
Licensing & Compatibility
Licensed under the Apache License 2.0. This license is generally permissive and compatible with commercial use and closed-source linking.
Limitations & Caveats
While designed for privacy, intermediate activations processed by a participating node can still leak partial token information to a malicious node; mitigating this is an open problem. Ongoing development includes Phase 1 (NAT traversal, quantization) and Phase 3 (permissionless swarm, dynamic allocation). The current WAN transport relies on direct open ports, with NAT hole-punching and relay fallback planned for Phase 1.
1 day ago
Inactive
lightseekorg
HazyResearch
b4rtaz
ai-dynamo