ServerlessLLM  by ServerlessLLM

Open-source framework for serverless LLM deployment

created 1 year ago
513 stars

Top 61.8% on sourcepulse

GitHubView on GitHub
Project Summary

ServerlessLLM provides a serverless framework for deploying and managing Large Language Models (LLMs) efficiently and affordably. It targets AI practitioners and researchers seeking to reduce the cost and complexity of custom LLM deployments, enabling elastic scaling and multi-model sharing on AI hardware.

How It Works

ServerlessLLM employs a full-stack, LLM-centric serverless system design. It optimizes checkpoint formats, inference runtimes (integrating with vLLM and HuggingFace Transformers), storage, and cluster scheduling. This approach aims to significantly reduce LLM startup latency and improve GPU utilization by allowing multiple models to share hardware with minimal switching overhead and supporting seamless live migration.

Quick Start & Requirements

  • Installation: pip install serverless-llm (for head node) and pip install serverless-llm[worker] (for worker nodes). Requires Python 3.10.
  • Prerequisites: NVIDIA and AMD GPUs are supported. Integration with vLLM is highlighted.
  • Resources: Local cluster setup via Quick Start Guide. Documentation available for detailed installation and API usage.
  • Links: Documentation, Quick Start Guide, Benchmark

Highlighted Details

  • Achieves 5-10X faster checkpoint loading than Safetensors/PyTorch Checkpoint Loader.
  • Offers 5-100X lower startup latency compared to Ray Serve and KServe.
  • Supports seamless inference live migration and multi-LLM GPU sharing.
  • Integrates with OpenAI Query API and supports embedding-based RAG + LLM deployment.

Maintenance & Community

  • Maintained by a growing global team of over 10 developers.
  • Community channels available on Discord and WeChat.
  • Presented at Nvidia HQ and invited to global AI tech forums.

Licensing & Compatibility

  • The README does not explicitly state the license type. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

  • Experimental support for fast checkpoint loading on AMD GPUs (ROCm) with vLLM, PyTorch, and HuggingFace Accelerate is noted. The project is actively developing new features based on community input.
Health Check
Last commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)
16
Issues (30d)
4
Star History
53 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Philipp Schmid Philipp Schmid(DevRel at Google DeepMind), and
2 more.

LightLLM by ModelTC

0.7%
3k
Python framework for LLM inference and serving
created 2 years ago
updated 16 hours ago
Starred by Lewis Tunstall Lewis Tunstall(Researcher at Hugging Face), Robert Nishihara Robert Nishihara(Cofounder of Anyscale; Author of Ray), and
4 more.

verl by volcengine

2.4%
12k
RL training library for LLMs
created 9 months ago
updated 15 hours ago
Feedback? Help us improve.