sglang  by sgl-project

Fast serving framework for LLMs and vision language models

Created 1 year ago
18,006 stars

Top 2.5% on SourcePulse

GitHubView on GitHub
Project Summary

SGLang is a high-performance serving framework for large language and vision-language models, designed to accelerate LLM interactions and enhance control. It targets researchers and developers needing efficient, flexible, and scalable model deployment, offering significant speedups and advanced programming capabilities.

How It Works

SGLang co-designs a fast backend runtime with a flexible frontend language. The backend leverages optimizations like RadixAttention for prefix caching, a zero-overhead CPU scheduler, continuous batching, and speculative decoding. The frontend provides an intuitive Pythonic interface for complex LLM programming, including chained generation, control flow, and multi-modal inputs. This integrated approach aims to deliver superior performance and programmability compared to separate runtime and API solutions.

Quick Start & Requirements

  • Install: pip install sglang
  • Prerequisites: Python 3.8+, PyTorch, CUDA (for GPU acceleration). Specific model support may require additional dependencies.
  • Resources: GPU recommended for optimal performance.
  • Docs: Documentation, Quick Start

Highlighted Details

  • Achieves up to 5x faster inference with RadixAttention.
  • Enables 3x faster JSON decoding via compressed finite state machines.
  • Supports a wide array of models including Llama, Gemma, Mistral, LLaVA, and embedding/reward models.
  • Offers advanced features like tensor parallelism, quantization (FP8/INT4/AWQ/GPTQ), and multi-LoRA batching.

Maintenance & Community

The project is actively maintained with frequent releases and has significant industry adoption, powering trillions of tokens daily. It is backed by numerous institutions including AMD, NVIDIA, LMSYS, Stanford, and UC Berkeley. Community engagement is encouraged via Slack and bi-weekly development meetings.

Licensing & Compatibility

The project is licensed under the Apache License 2.0, permitting commercial use and integration with closed-source applications.

Limitations & Caveats

While SGLang boasts extensive features and performance claims, some advanced optimizations like RadixAttention are noted as experimental. The project also acknowledges reusing code and design from several other LLM serving frameworks.

Health Check
Last Commit

12 hours ago

Responsiveness

1 day

Pull Requests (30d)
1,037
Issues (30d)
513
Star History
1,037 stars in the last 30 days

Explore Similar Projects

Starred by Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
11 more.

mistral.rs by EricLBuehler

0.3%
6k
LLM inference engine for blazing fast performance
Created 1 year ago
Updated 22 hours ago
Feedback? Help us improve.