sglang  by sgl-project

Fast serving framework for LLMs and vision language models

Created 2 years ago
25,627 stars

Top 1.7% on SourcePulse

GitHubView on GitHub
Project Summary

SGLang is a high-performance serving framework for large language and vision-language models, designed to accelerate LLM interactions and enhance control. It targets researchers and developers needing efficient, flexible, and scalable model deployment, offering significant speedups and advanced programming capabilities.

How It Works

SGLang co-designs a fast backend runtime with a flexible frontend language. The backend leverages optimizations like RadixAttention for prefix caching, a zero-overhead CPU scheduler, continuous batching, and speculative decoding. The frontend provides an intuitive Pythonic interface for complex LLM programming, including chained generation, control flow, and multi-modal inputs. This integrated approach aims to deliver superior performance and programmability compared to separate runtime and API solutions.

Quick Start & Requirements

  • Install: pip install sglang
  • Prerequisites: Python 3.8+, PyTorch, CUDA (for GPU acceleration). Specific model support may require additional dependencies.
  • Resources: GPU recommended for optimal performance.
  • Docs: Documentation, Quick Start

Highlighted Details

  • Achieves up to 5x faster inference with RadixAttention.
  • Enables 3x faster JSON decoding via compressed finite state machines.
  • Supports a wide array of models including Llama, Gemma, Mistral, LLaVA, and embedding/reward models.
  • Offers advanced features like tensor parallelism, quantization (FP8/INT4/AWQ/GPTQ), and multi-LoRA batching.

Maintenance & Community

The project is actively maintained with frequent releases and has significant industry adoption, powering trillions of tokens daily. It is backed by numerous institutions including AMD, NVIDIA, LMSYS, Stanford, and UC Berkeley. Community engagement is encouraged via Slack and bi-weekly development meetings.

Licensing & Compatibility

The project is licensed under the Apache License 2.0, permitting commercial use and integration with closed-source applications.

Limitations & Caveats

While SGLang boasts extensive features and performance claims, some advanced optimizations like RadixAttention are noted as experimental. The project also acknowledges reusing code and design from several other LLM serving frameworks.

Health Check
Last Commit

21 hours ago

Responsiveness

1 day

Pull Requests (30d)
1,833
Issues (30d)
554
Star History
1,699 stars in the last 30 days

Explore Similar Projects

Starred by Matthew Johnson Matthew Johnson(Coauthor of JAX; Research Scientist at Google Brain), Roy Frostig Roy Frostig(Coauthor of JAX; Research Scientist at Google DeepMind), and
3 more.

sglang-jax by sgl-project

1.5%
264
High-performance LLM inference engine for JAX/TPU serving
Created 8 months ago
Updated 1 day ago
Starred by Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
12 more.

mistral.rs by EricLBuehler

1.6%
7k
LLM inference engine for blazing fast performance
Created 2 years ago
Updated 1 day ago
Feedback? Help us improve.