sglang  by sgl-project

Fast serving framework for LLMs and vision language models

created 1 year ago
16,437 stars

Top 2.9% on sourcepulse

GitHubView on GitHub
Project Summary

SGLang is a high-performance serving framework for large language and vision-language models, designed to accelerate LLM interactions and enhance control. It targets researchers and developers needing efficient, flexible, and scalable model deployment, offering significant speedups and advanced programming capabilities.

How It Works

SGLang co-designs a fast backend runtime with a flexible frontend language. The backend leverages optimizations like RadixAttention for prefix caching, a zero-overhead CPU scheduler, continuous batching, and speculative decoding. The frontend provides an intuitive Pythonic interface for complex LLM programming, including chained generation, control flow, and multi-modal inputs. This integrated approach aims to deliver superior performance and programmability compared to separate runtime and API solutions.

Quick Start & Requirements

  • Install: pip install sglang
  • Prerequisites: Python 3.8+, PyTorch, CUDA (for GPU acceleration). Specific model support may require additional dependencies.
  • Resources: GPU recommended for optimal performance.
  • Docs: Documentation, Quick Start

Highlighted Details

  • Achieves up to 5x faster inference with RadixAttention.
  • Enables 3x faster JSON decoding via compressed finite state machines.
  • Supports a wide array of models including Llama, Gemma, Mistral, LLaVA, and embedding/reward models.
  • Offers advanced features like tensor parallelism, quantization (FP8/INT4/AWQ/GPTQ), and multi-LoRA batching.

Maintenance & Community

The project is actively maintained with frequent releases and has significant industry adoption, powering trillions of tokens daily. It is backed by numerous institutions including AMD, NVIDIA, LMSYS, Stanford, and UC Berkeley. Community engagement is encouraged via Slack and bi-weekly development meetings.

Licensing & Compatibility

The project is licensed under the Apache License 2.0, permitting commercial use and integration with closed-source applications.

Limitations & Caveats

While SGLang boasts extensive features and performance claims, some advanced optimizations like RadixAttention are noted as experimental. The project also acknowledges reusing code and design from several other LLM serving frameworks.

Health Check
Last commit

10 hours ago

Responsiveness

1 day

Pull Requests (30d)
683
Issues (30d)
408
Star History
2,679 stars in the last 90 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera) and Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

InternEvo by InternLM

1.0%
402
Lightweight training framework for model pre-training
created 1 year ago
updated 1 week ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Philipp Schmid Philipp Schmid(DevRel at Google DeepMind), and
2 more.

LightLLM by ModelTC

0.7%
3k
Python framework for LLM inference and serving
created 2 years ago
updated 11 hours ago
Starred by Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), Michael Han Michael Han(Cofounder of Unsloth), and
1 more.

ktransformers by kvcache-ai

0.4%
15k
Framework for LLM inference optimization experimentation
created 1 year ago
updated 2 days ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
5 more.

TensorRT-LLM by NVIDIA

0.6%
11k
LLM inference optimization SDK for NVIDIA GPUs
created 1 year ago
updated 14 hours ago
Starred by Lewis Tunstall Lewis Tunstall(Researcher at Hugging Face), Robert Nishihara Robert Nishihara(Cofounder of Anyscale; Author of Ray), and
4 more.

verl by volcengine

2.4%
12k
RL training library for LLMs
created 9 months ago
updated 10 hours ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Nat Friedman Nat Friedman(Former CEO of GitHub), and
32 more.

llama.cpp by ggml-org

0.4%
84k
C/C++ library for local LLM inference
created 2 years ago
updated 10 hours ago
Feedback? Help us improve.