CLI tool to manage scalable open LLM inference endpoints in Slurm clusters
Top 96.5% on sourcepulse
This project provides a Python framework for managing scalable, on-demand LLM inference endpoints within Slurm clusters. It's designed for researchers and engineers who need to efficiently deploy and utilize multiple LLM instances for tasks like synthetic data generation or batch inference, abstracting away the complexities of Slurm job submission and endpoint management.
How It Works
llm-swarm
automates the deployment of LLM inference servers (like Text Generation Inference or vLLM) as Slurm jobs. It generates Slurm scripts based on provided templates, submits them, and monitors their startup. Once instances are running, it configures an Nginx load balancer to distribute requests across them, providing a single, scalable endpoint. Upon completion of inference tasks, it automatically terminates the Slurm jobs to prevent resource wastage.
Quick Start & Requirements
pip install -e .
python examples/hello_world.py
Highlighted Details
text-generation-inference
and vLLM
.Maintenance & Community
The project is maintained by Hugging Face. Further community engagement details are not explicitly provided in the README.
Licensing & Compatibility
The project appears to be under the Apache 2.0 license, which is permissive for commercial use and integration with closed-source projects.
Limitations & Caveats
The primary requirement is a functional Slurm cluster with Docker. While Hugging Face Inference Endpoints can be used as a fallback, they are subject to rate limits. The provided Slurm templates are specific to H100 GPUs and may require customization for other hardware.
1 year ago
Inactive