mesh-llm by Mesh-LLM

Scalable distributed LLM inference across machines

Created 5 months ago

1,288 stars

Top 30.2% on SourcePulse

View on GitHub

1 Expert Loves This Project

Vincent Weisser

Cofounder of Prime Intellect

Project Summary

Summary

Mesh-LLM enables distributed inference for Large Language Models (LLMs) by pooling spare GPU capacity across machines. It tackles models too large for single nodes via automatic pipeline or expert parallelism and facilitates agent communication through a decentralized "gossip" protocol. This project targets engineers and researchers aiming to scale LLM deployments and efficiently utilize distributed compute resources.

How It Works

Leveraging a forked llama.cpp, Mesh-LLM automatically distributes LLMs. Dense models use pipeline parallelism, splitting layers across nodes. Mixture-of-Experts (MoE) models employ expert sharding, minimizing cross-node traffic. The system prioritizes low-latency connections, using HTTP streaming for inference to mitigate network overhead. Agents communicate via a decentralized blackboard, forming a "gossip" layer for collaborative workflows.

Quick Start & Requirements

Installation is via a bash script (curl ... | bash) supporting macOS, Linux, and Windows (source/zip). Building from source (git clone ... && just build) is an alternative. Prerequisites include just, cmake, Rust, Node.js (v24+), and GPU toolkits (CUDA, ROCm, Vulkan, Metal; CPU-only supported). Detailed build instructions are in CONTRIBUTING.md.

Highlighted Details

Provides an OpenAI-compatible API (http://localhost:9337/v1).
Automatic distribution adapts to model architecture (dense/MoE) and hardware.
Decentralized blackboard enables agent knowledge sharing and communication.
Supports speculative decoding for improved throughput.
Allows multi-model serving across different nodes.
Features demand-aware rebalancing for dynamic load adjustment.

Maintenance & Community

Community discussion occurs on the #mesh-llm channel on the Goose Discord. Development workflows are detailed in CONTRIBUTING.md.

Licensing & Compatibility

The README omits license information, preventing assessment of commercial use or closed-source linking compatibility.

Limitations & Caveats

Pipeline parallelism significantly reduces inference throughput (e.g., 68 tok/s solo vs. 12-13 tok/s on a 3-node split). Cross-network latency impacts time-to-first-token. Advanced features like mesh-wide rebalancing are planned for "Stage Two." The missing license is a critical adoption blocker.

Health Check

Last Commit

17 hours ago

Responsiveness

Inactive

Pull Requests (30d)

110

Issues (30d)

Star History

148 stars in the last 30 days