sglang by sgl-project

Fast serving framework for LLMs and vision language models

Created 2 years ago

22,245 stars

Top 1.9% on SourcePulse

View on GitHub

39 Experts Love This Project

Michael Han

Cofounder of Unsloth

Beyang Liu

Cofounder of Sourcegraph

Sam Lambert

CEO of PlanetScale

Elie Bursztein

Cybersecurity Lead at Google DeepMind

and 35 more!

Project Summary

SGLang is a high-performance serving framework for large language and vision-language models, designed to accelerate LLM interactions and enhance control. It targets researchers and developers needing efficient, flexible, and scalable model deployment, offering significant speedups and advanced programming capabilities.

How It Works

SGLang co-designs a fast backend runtime with a flexible frontend language. The backend leverages optimizations like RadixAttention for prefix caching, a zero-overhead CPU scheduler, continuous batching, and speculative decoding. The frontend provides an intuitive Pythonic interface for complex LLM programming, including chained generation, control flow, and multi-modal inputs. This integrated approach aims to deliver superior performance and programmability compared to separate runtime and API solutions.

Quick Start & Requirements

Install: pip install sglang
Prerequisites: Python 3.8+, PyTorch, CUDA (for GPU acceleration). Specific model support may require additional dependencies.
Resources: GPU recommended for optimal performance.
Docs: Documentation, Quick Start

Highlighted Details

Achieves up to 5x faster inference with RadixAttention.
Enables 3x faster JSON decoding via compressed finite state machines.
Supports a wide array of models including Llama, Gemma, Mistral, LLaVA, and embedding/reward models.
Offers advanced features like tensor parallelism, quantization (FP8/INT4/AWQ/GPTQ), and multi-LoRA batching.

Maintenance & Community

The project is actively maintained with frequent releases and has significant industry adoption, powering trillions of tokens daily. It is backed by numerous institutions including AMD, NVIDIA, LMSYS, Stanford, and UC Berkeley. Community engagement is encouraged via Slack and bi-weekly development meetings.

Licensing & Compatibility

The project is licensed under the Apache License 2.0, permitting commercial use and integration with closed-source applications.

Limitations & Caveats

While SGLang boasts extensive features and performance claims, some advanced optimizations like RadixAttention are noted as experimental. The project also acknowledges reusing code and design from several other LLM serving frameworks.

Health Check

Last Commit

9 hours ago

Responsiveness

1 day

Pull Requests (30d)

1,775

Issues (30d)

508

Star History

1,327 stars in the last 30 days