llm-swarm  by huggingface

CLI tool to manage scalable open LLM inference endpoints in Slurm clusters

Created 1 year ago
272 stars

Top 94.8% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides a Python framework for managing scalable, on-demand LLM inference endpoints within Slurm clusters. It's designed for researchers and engineers who need to efficiently deploy and utilize multiple LLM instances for tasks like synthetic data generation or batch inference, abstracting away the complexities of Slurm job submission and endpoint management.

How It Works

llm-swarm automates the deployment of LLM inference servers (like Text Generation Inference or vLLM) as Slurm jobs. It generates Slurm scripts based on provided templates, submits them, and monitors their startup. Once instances are running, it configures an Nginx load balancer to distribute requests across them, providing a single, scalable endpoint. Upon completion of inference tasks, it automatically terminates the Slurm jobs to prevent resource wastage.

Quick Start & Requirements

  • Install: pip install -e .
  • Prerequisites: A Slurm cluster with Docker support. Custom Slurm templates are provided for TGI and vLLM on H100 GPUs.
  • Example: python examples/hello_world.py
  • More Info: Hugging Face LLM Swarm

Highlighted Details

  • Integrates with Hugging Face text-generation-inference and vLLM.
  • Supports dynamic scaling of inference endpoints via Slurm.
  • Includes Nginx for load balancing across multiple inference instances.
  • Provides benchmarking utilities for TGI and vLLM.
  • Offers a development mode for continuous endpoint availability.

Maintenance & Community

The project is maintained by Hugging Face. Further community engagement details are not explicitly provided in the README.

Licensing & Compatibility

The project appears to be under the Apache 2.0 license, which is permissive for commercial use and integration with closed-source projects.

Limitations & Caveats

The primary requirement is a functional Slurm cluster with Docker. While Hugging Face Inference Endpoints can be used as a fallback, they are subject to rate limits. The provided Slurm templates are specific to H100 GPUs and may require customization for other hardware.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 30 days

Explore Similar Projects

Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
9 more.

LightLLM by ModelTC

0.5%
4k
Python framework for LLM inference and serving
Created 2 years ago
Updated 12 hours ago
Starred by Lianmin Zheng Lianmin Zheng(Coauthor of SGLang, vLLM), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
1 more.

MiniCPM by OpenBMB

0.4%
8k
Ultra-efficient LLMs for end devices, achieving 5x+ speedup
Created 1 year ago
Updated 1 week ago
Starred by Carol Willing Carol Willing(Core Contributor to CPython, Jupyter), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
9 more.

dynamo by ai-dynamo

1.0%
5k
Inference framework for distributed generative AI model serving
Created 6 months ago
Updated 13 hours ago
Feedback? Help us improve.