beowulf-ai-cluster by geerlingguy

AI cluster deployment and benchmarking for distributed LLM inference

Created 5 months ago

309 stars

Top 87.1% on SourcePulse

Project Summary

This project provides an Ansible-based framework for deploying and benchmarking distributed AI clusters on heterogeneous hardware. It targets engineers and researchers needing to test various distributed AI tools, offering a repeatable setup for diverse computing environments.

How It Works

The core approach leverages Ansible to automate the deployment of AI software, including llama.cpp and distributed-llama, across a cluster of computers. A primary playbook (main.yml) handles both the setup phase (downloading and compiling necessary code) and the execution of AI benchmarks. This method allows for consistent configuration across potentially varied hardware capabilities.

Quick Start & Requirements

Installation requires pip3 install ansible. Ensure SSH access is configured for all cluster nodes, with the ansible_user variable set appropriately. For .local domain resolution on Ubuntu, avahi-daemon may need installation (sudo apt-get install avahi-daemon).

Copy example.hosts.ini to hosts.ini and example.config.yml to config.yml.
Edit hosts.ini with cluster hostnames (which must match actual hostnames) and config.yml with desired settings.
Run the main playbook: ansible-playbook main.yml. Specific benchmarks can be tagged (e.g., --tags llama-bench-cluster).

Highlighted Details

Supports automated benchmarking for llama.cpp (individual nodes and full cluster via RPC) and distributed-llama (full cluster).
Provides manual benchmarking procedures for llama.cpp RPC, distributed-llama, and Exo.
Designed for heterogeneous environments ("random computers with random capabilities").
Forcing a rebuild of llama.cpp involves manually removing its directory (/opt/llama.cpp).

Maintenance & Community

The project is authored by Jeff Geerling. Benchmark results are currently stored externally in the ollama-benchmark project. No community channels (like Discord or Slack) or explicit roadmaps are detailed in the provided README.

Licensing & Compatibility

The project is licensed under GPLv3. This strong copyleft license necessitates careful consideration for integration into commercial or closed-source projects, as it may require distributing derivative works under the same license.

Limitations & Caveats

The development status of Exo appears stagnant, limiting its benchmarking support to manual execution. Rebuilding core components like llama.cpp requires manual intervention. Benchmark results are managed externally, and the setup relies heavily on Ansible configuration and SSH connectivity.

beowulf-ai-cluster by geerlingguy

Explore Similar Projects

openmodelz by tensorchord

dfserver by huo-ju

llmaz by InftyAI

MCP-Universe by SalesforceAIResearch

parallax by GradientHQ

generative-ai-application-builder-on-aws by aws-solutions

n8n-install by kossakovsky

distributed-llama by b4rtaz

LLMStack by trypromptly

exo by exo-explore

modular by modular

ray by ray-project