sglang-jax by sgl-project

High-performance LLM inference engine for JAX/TPU serving

Created 11 months ago

302 stars

Top 88.1% on SourcePulse

View on GitHub

5 Experts Love This Project

Matthew Johnson

Coauthor of JAX; Research Scientist at Google Brain

Roy Frostig

Coauthor of JAX; Research Scientist at Google DeepMind

Jeff Hammerbacher

Cofounder of Cloudera

Lianmin Zheng

Coauthor of SGLang, vLLM

and 1 more!

Project Summary

Summary

SGL-JAX is a high-performance, JAX-based inference engine for Large Language Models (LLMs), optimized for Google TPUs. It targets demanding LLM serving workloads by delivering exceptional throughput and low latency through state-of-the-art techniques for maximum hardware utilization.

How It Works

The engine employs an OpenAI-compatible HTTP API and a sophisticated scheduler for high-throughput continuous batching. It utilizes an optimized KV cache with a Radix Tree for memory-efficient prefix sharing and integrates FlashAttention kernels. Native tensor parallelism distributes large models across multiple TPU devices for scalable inference.

Quick Start & Requirements

Installation and quick start guides are detailed in the project's documentation. Primary requirements include a JAX/TPU environment. Further setup and usage instructions are available in the docs directory.

Highlighted Details

Features high-throughput continuous batching and an optimized KV cache with Radix Tree.
Integrates FlashAttention kernels and supports native tensor parallelism.
Provides an OpenAI-compatible API.
Offers first-class support for Qwen models (including MoE), Qwen 3 series (best performance), and multimodal models (Text-to-Video, Qwen2.5-VL).
Supports Eagle's Speculative Decoding for MiMo-7B.

Maintenance & Community

Contribution guidelines are provided. Community discussions occur on the SGLang Slack workspace. No specific details on core maintainers, sponsorships, or a public roadmap are present.

Licensing & Compatibility

The README does not specify a software license, requiring further investigation for usage restrictions, especially for commercial applications.

Limitations & Caveats

Performance requires improvement for several supported models, including Qwen, Qwen 2, Qwen 2 MoE, Llama, Bailing MoE, and MiMo-7B, indicating ongoing optimization efforts.

Health Check

Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

24 stars in the last 30 days