llm-swarm  by huggingface

CLI tool to manage scalable open LLM inference endpoints in Slurm clusters

created 1 year ago
268 stars

Top 96.5% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a Python framework for managing scalable, on-demand LLM inference endpoints within Slurm clusters. It's designed for researchers and engineers who need to efficiently deploy and utilize multiple LLM instances for tasks like synthetic data generation or batch inference, abstracting away the complexities of Slurm job submission and endpoint management.

How It Works

llm-swarm automates the deployment of LLM inference servers (like Text Generation Inference or vLLM) as Slurm jobs. It generates Slurm scripts based on provided templates, submits them, and monitors their startup. Once instances are running, it configures an Nginx load balancer to distribute requests across them, providing a single, scalable endpoint. Upon completion of inference tasks, it automatically terminates the Slurm jobs to prevent resource wastage.

Quick Start & Requirements

  • Install: pip install -e .
  • Prerequisites: A Slurm cluster with Docker support. Custom Slurm templates are provided for TGI and vLLM on H100 GPUs.
  • Example: python examples/hello_world.py
  • More Info: Hugging Face LLM Swarm

Highlighted Details

  • Integrates with Hugging Face text-generation-inference and vLLM.
  • Supports dynamic scaling of inference endpoints via Slurm.
  • Includes Nginx for load balancing across multiple inference instances.
  • Provides benchmarking utilities for TGI and vLLM.
  • Offers a development mode for continuous endpoint availability.

Maintenance & Community

The project is maintained by Hugging Face. Further community engagement details are not explicitly provided in the README.

Licensing & Compatibility

The project appears to be under the Apache 2.0 license, which is permissive for commercial use and integration with closed-source projects.

Limitations & Caveats

The primary requirement is a functional Slurm cluster with Docker. While Hugging Face Inference Endpoints can be used as a fallback, they are subject to rate limits. The provided Slurm templates are specific to H100 GPUs and may require customization for other hardware.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
0
Star History
14 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.