llm-swarm by huggingface

CLI tool to manage scalable open LLM inference endpoints in Slurm clusters

Created 2 years ago

278 stars

Top 93.5% on SourcePulse

View on GitHub

5 Experts Love This Project

Maxime Labonne

Head of Post-Training at Liquid AI

Thomas Wolf

Cofounder of Hugging Face

Omar Sanseviero

DevRel at Google DeepMind

Nathan Lambert

Research Scientist at AI2

and 1 more!

Project Summary

This project provides a Python framework for managing scalable, on-demand LLM inference endpoints within Slurm clusters. It's designed for researchers and engineers who need to efficiently deploy and utilize multiple LLM instances for tasks like synthetic data generation or batch inference, abstracting away the complexities of Slurm job submission and endpoint management.

How It Works

llm-swarm automates the deployment of LLM inference servers (like Text Generation Inference or vLLM) as Slurm jobs. It generates Slurm scripts based on provided templates, submits them, and monitors their startup. Once instances are running, it configures an Nginx load balancer to distribute requests across them, providing a single, scalable endpoint. Upon completion of inference tasks, it automatically terminates the Slurm jobs to prevent resource wastage.

Quick Start & Requirements

Install: pip install -e .
Prerequisites: A Slurm cluster with Docker support. Custom Slurm templates are provided for TGI and vLLM on H100 GPUs.
Example: python examples/hello_world.py
More Info: Hugging Face LLM Swarm

Highlighted Details

Integrates with Hugging Face text-generation-inference and vLLM.
Supports dynamic scaling of inference endpoints via Slurm.
Includes Nginx for load balancing across multiple inference instances.
Provides benchmarking utilities for TGI and vLLM.
Offers a development mode for continuous endpoint availability.

Maintenance & Community

The project is maintained by Hugging Face. Further community engagement details are not explicitly provided in the README.

Licensing & Compatibility

The project appears to be under the Apache 2.0 license, which is permissive for commercial use and integration with closed-source projects.

Limitations & Caveats

The primary requirement is a functional Slurm cluster with Docker. While Hugging Face Inference Endpoints can be used as a fallback, they are subject to rate limits. The provided Slurm templates are specific to H100 GPUs and may require customization for other hardware.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days