beowulf-ai-cluster  by geerlingguy

AI cluster deployment and benchmarking for distributed LLM inference

Created 3 months ago
256 stars

Top 98.6% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides an Ansible-based framework for deploying and benchmarking distributed AI clusters on heterogeneous hardware. It targets engineers and researchers needing to test various distributed AI tools, offering a repeatable setup for diverse computing environments.

How It Works

The core approach leverages Ansible to automate the deployment of AI software, including llama.cpp and distributed-llama, across a cluster of computers. A primary playbook (main.yml) handles both the setup phase (downloading and compiling necessary code) and the execution of AI benchmarks. This method allows for consistent configuration across potentially varied hardware capabilities.

Quick Start & Requirements

Installation requires pip3 install ansible. Ensure SSH access is configured for all cluster nodes, with the ansible_user variable set appropriately. For .local domain resolution on Ubuntu, avahi-daemon may need installation (sudo apt-get install avahi-daemon).

  1. Copy example.hosts.ini to hosts.ini and example.config.yml to config.yml.
  2. Edit hosts.ini with cluster hostnames (which must match actual hostnames) and config.yml with desired settings.
  3. Run the main playbook: ansible-playbook main.yml. Specific benchmarks can be tagged (e.g., --tags llama-bench-cluster).

Highlighted Details

  • Supports automated benchmarking for llama.cpp (individual nodes and full cluster via RPC) and distributed-llama (full cluster).
  • Provides manual benchmarking procedures for llama.cpp RPC, distributed-llama, and Exo.
  • Designed for heterogeneous environments ("random computers with random capabilities").
  • Forcing a rebuild of llama.cpp involves manually removing its directory (/opt/llama.cpp).

Maintenance & Community

The project is authored by Jeff Geerling. Benchmark results are currently stored externally in the ollama-benchmark project. No community channels (like Discord or Slack) or explicit roadmaps are detailed in the provided README.

Licensing & Compatibility

The project is licensed under GPLv3. This strong copyleft license necessitates careful consideration for integration into commercial or closed-source projects, as it may require distributing derivative works under the same license.

Limitations & Caveats

The development status of Exo appears stagnant, limiting its benchmarking support to manual execution. Rebuilding core components like llama.cpp requires manual intervention. Benchmark results are managed externally, and the setup relies heavily on Ansible configuration and SSH connectivity.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
24 stars in the last 30 days

Explore Similar Projects

Starred by Anton Bukov Anton Bukov(Cofounder of 1inch Network), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
14 more.

exo by exo-explore

0.3%
32k
AI cluster for running models on diverse devices
Created 1 year ago
Updated 1 day ago
Starred by Chris Lattner Chris Lattner(Author of LLVM, Clang, Swift, Mojo, MLIR; Cofounder of Modular), Tobi Lutke Tobi Lutke(Cofounder of Shopify), and
12 more.

modular by modular

0.1%
25k
AI toolchain unifying fragmented AI deployment workflows
Created 2 years ago
Updated 16 hours ago
Feedback? Help us improve.