shard  by leyten

Pipeline-parallel LLM inference across distributed machines

Created 2 weeks ago

New!

399 stars

Top 72.0% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

Shard addresses the challenge of running Large Language Models (LLMs) that exceed single-GPU memory capacity by implementing pipeline parallelism across distributed machines, including over wide-area networks (WANs). It targets researchers and power users seeking to deploy frontier-sized models without requiring a dedicated datacenter or single high-end host. The primary benefit is enabling usable inference speeds for massive models distributed across geographically scattered, potentially heterogeneous GPUs.

How It Works

Shard partitions transformer layers into contiguous blocks, assigning each block to a separate GPU (a "shard"). Activations are streamed sequentially through these shards. To overcome WAN latency, it employs a multi-pronged strategy: speculative decoding using a small, CUDA-graphed "draft" model to propose multiple tokens, which are then verified by the larger distributed model in a single pipeline pass. Async pipelining overlaps multiple verification passes, and a direct-return ring topology minimizes communication hops. This approach shifts the bottleneck from latency to throughput, making WAN inference viable.

Quick Start & Requirements

  • Primary install/run: Not explicitly detailed; implies a "one command" join for participants.
  • Prerequisites: Requires GPUs (tested with RTX PRO 6000, RTX 4090), potentially across different machines and networks. Specific CUDA versions are not stated but implied by GPU requirements.
  • Documentation: Links to docs/ARCHITECTURE.md, docs/ROADMAP.md, docs/PROOF.md, and example receipts (docs/receipts/).

Highlighted Details

  • Achieves ~30 tok/s for a 744B parameter GLM-5.2 model across seven prosumer GPUs in six US states over WAN.
  • Serves a 120B parameter GPT-OSS model at ~40 tok/s across three consumer GPUs in different US states over WAN.
  • Generates verifiable receipts detailing hardware, network conditions (RTTs), and output hashes for transparency and auditability.
  • Designed with principles of decentralization (anyone can join), privacy (no single node holds the full model), and uncensored inference.

Maintenance & Community

The project is associated with "c0mpute" infrastructure. Specific community channels (Discord, Slack) or active maintainer lists are not detailed in the README. A roadmap is available at docs/ROADMAP.md.

Licensing & Compatibility

Licensed under the Apache License 2.0. This license is generally permissive and compatible with commercial use and closed-source linking.

Limitations & Caveats

While designed for privacy, intermediate activations processed by a participating node can still leak partial token information to a malicious node; mitigating this is an open problem. Ongoing development includes Phase 1 (NAT traversal, quantization) and Phase 3 (permissionless swarm, dynamic allocation). The current WAN transport relies on direct open ports, with NAT hole-punching and relay fallback planned for Phase 1.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
21
Issues (30d)
1
Star History
400 stars in the last 19 days

Explore Similar Projects

Starred by Eric Zhang Eric Zhang(Founding Engineer at Modal) and Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI).

smg by lightseekorg

3.6%
375
High-performance LLM gateway for diverse inference backends
Created 7 months ago
Updated 1 day ago
Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), Johannes Hagemann Johannes Hagemann(Cofounder of Prime Intellect), and
3 more.

minions by HazyResearch

0.4%
1k
Communication protocol for cost-efficient LLM collaboration
Created 1 year ago
Updated 3 months ago
Feedback? Help us improve.