mesh-llm  by michaelneale

Scalable distributed LLM inference across machines

Created 2 months ago
720 stars

Top 47.5% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

Mesh-LLM enables distributed inference for Large Language Models (LLMs) by pooling spare GPU capacity across machines. It tackles models too large for single nodes via automatic pipeline or expert parallelism and facilitates agent communication through a decentralized "gossip" protocol. This project targets engineers and researchers aiming to scale LLM deployments and efficiently utilize distributed compute resources.

How It Works

Leveraging a forked llama.cpp, Mesh-LLM automatically distributes LLMs. Dense models use pipeline parallelism, splitting layers across nodes. Mixture-of-Experts (MoE) models employ expert sharding, minimizing cross-node traffic. The system prioritizes low-latency connections, using HTTP streaming for inference to mitigate network overhead. Agents communicate via a decentralized blackboard, forming a "gossip" layer for collaborative workflows.

Quick Start & Requirements

Installation is via a bash script (curl ... | bash) supporting macOS, Linux, and Windows (source/zip). Building from source (git clone ... && just build) is an alternative. Prerequisites include just, cmake, Rust, Node.js (v24+), and GPU toolkits (CUDA, ROCm, Vulkan, Metal; CPU-only supported). Detailed build instructions are in CONTRIBUTING.md.

Highlighted Details

  • Provides an OpenAI-compatible API (http://localhost:9337/v1).
  • Automatic distribution adapts to model architecture (dense/MoE) and hardware.
  • Decentralized blackboard enables agent knowledge sharing and communication.
  • Supports speculative decoding for improved throughput.
  • Allows multi-model serving across different nodes.
  • Features demand-aware rebalancing for dynamic load adjustment.

Maintenance & Community

Community discussion occurs on the #mesh-llm channel on the Goose Discord. Development workflows are detailed in CONTRIBUTING.md.

Licensing & Compatibility

The README omits license information, preventing assessment of commercial use or closed-source linking compatibility.

Limitations & Caveats

Pipeline parallelism significantly reduces inference throughput (e.g., 68 tok/s solo vs. 12-13 tok/s on a 3-node split). Cross-network latency impacts time-to-first-token. Advanced features like mesh-wide rebalancing are planned for "Stage Two." The missing license is a critical adoption blocker.

Health Check
Last Commit

20 hours ago

Responsiveness

Inactive

Pull Requests (30d)
197
Issues (30d)
46
Star History
720 stars in the last 30 days

Explore Similar Projects

Starred by Matthew Johnson Matthew Johnson(Coauthor of JAX; Research Scientist at Google Brain), Roy Frostig Roy Frostig(Coauthor of JAX; Research Scientist at Google DeepMind), and
3 more.

sglang-jax by sgl-project

1.5%
264
High-performance LLM inference engine for JAX/TPU serving
Created 8 months ago
Updated 1 day ago
Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), Johannes Hagemann Johannes Hagemann(Cofounder of Prime Intellect), and
3 more.

minions by HazyResearch

0.1%
1k
Communication protocol for cost-efficient LLM collaboration
Created 1 year ago
Updated 1 month ago
Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
11 more.

petals by bigscience-workshop

0.2%
10k
Run LLMs at home, BitTorrent-style
Created 3 years ago
Updated 1 year ago
Feedback? Help us improve.