parallax by GradientHQ

Decentralized AI cluster framework for distributed LLM inference

Created 3 months ago

1,069 stars

Top 35.4% on SourcePulse

View on GitHub

1 Expert Loves This Project

Awni Hannun

Author of MLX; Research Scientist at Apple

Project Summary

Summary

Parallax is a fully decentralized inference engine designed to build AI clusters anywhere, enabling users to deploy and serve large language models across distributed nodes with varying configurations. It targets developers and power users looking to host LLMs on personal devices or distributed infrastructure, offering a flexible and high-performance solution for model inference. The framework's key benefit is its ability to abstract away hardware heterogeneity and network complexities, allowing for seamless distributed model sharding and efficient request handling.

How It Works

Parallax employs a peer-to-peer (P2P) communication backbone powered by Lattica, integrating a GPU backend via SGLang and a Mac backend using MLX. Its architecture supports pipeline parallel model sharding, allowing large models to be split across multiple nodes. Novel features include dynamic KV cache management and continuous batching specifically for macOS, alongside dynamic request scheduling and routing for optimized throughput and low latency across the distributed cluster.

Quick Start & Requirements

Installation varies by OS and hardware. Linux/WSL users with GPUs can clone the repo and run pip install -e '.[gpu]'. macOS users with Apple Silicon should use a Python virtual environment and pip install -e '.[mac]'. Windows users can install via a provided .exe installer, followed by parallax install, a process estimated to take around 30 minutes. Docker images are available for Linux+GPU setups (e.g., gradientservice/parallax:latest-hopper). Python version 3.11.0 to 3.13.x is required. Specific Ubuntu versions are noted for Blackwell GPUs. Official setup involves launching a scheduler via parallax run and connecting nodes using parallax join. A web-based frontend is available at http://localhost:3001.

Highlighted Details

Fully decentralized inference engine for distributed AI clusters.
Enables hosting local LLMs on personal devices.
Supports pipeline parallel model sharding across heterogeneous nodes.
Features dynamic KV cache management and continuous batching (Mac-specific).
Dynamic request scheduling and routing for high performance.
Backend technologies include Lattica (P2P), SGLang (GPU), and MLX (Mac).
Compatible with a range of models including Meta Llama 3, Qwen, DeepSeek, and others.

Maintenance & Community

Project links include Gradient, Blog, X (Twitter), and Discord. The README mentions a recent release (v0.0.1) in October 2025, suggesting active development.

Licensing & Compatibility

The provided README does not specify a software license. This lack of explicit licensing information may pose compatibility concerns for commercial use or integration into closed-source projects.

Limitations & Caveats

Installation from source is not recommended for Windows. Linux GPU support is tied to specific Ubuntu versions for Blackwell GPUs. The project is at version 0.0.1, indicating it may still be in early development stages. Crucially, no software license is stated, which is a significant adoption blocker.

Health Check

Last Commit

17 hours ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

72 stars in the last 30 days