sglang-jax  by sgl-project

High-performance LLM inference engine for JAX/TPU serving

Created 8 months ago
264 stars

Top 96.6% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

SGL-JAX is a high-performance, JAX-based inference engine for Large Language Models (LLMs), optimized for Google TPUs. It targets demanding LLM serving workloads by delivering exceptional throughput and low latency through state-of-the-art techniques for maximum hardware utilization.

How It Works

The engine employs an OpenAI-compatible HTTP API and a sophisticated scheduler for high-throughput continuous batching. It utilizes an optimized KV cache with a Radix Tree for memory-efficient prefix sharing and integrates FlashAttention kernels. Native tensor parallelism distributes large models across multiple TPU devices for scalable inference.

Quick Start & Requirements

Installation and quick start guides are detailed in the project's documentation. Primary requirements include a JAX/TPU environment. Further setup and usage instructions are available in the docs directory.

Highlighted Details

  • Features high-throughput continuous batching and an optimized KV cache with Radix Tree.
  • Integrates FlashAttention kernels and supports native tensor parallelism.
  • Provides an OpenAI-compatible API.
  • Offers first-class support for Qwen models (including MoE), Qwen 3 series (best performance), and multimodal models (Text-to-Video, Qwen2.5-VL).
  • Supports Eagle's Speculative Decoding for MiMo-7B.

Maintenance & Community

Contribution guidelines are provided. Community discussions occur on the SGLang Slack workspace. No specific details on core maintainers, sponsorships, or a public roadmap are present.

Licensing & Compatibility

The README does not specify a software license, requiring further investigation for usage restrictions, especially for commercial applications.

Limitations & Caveats

Performance requires improvement for several supported models, including Qwen, Qwen 2, Qwen 2 MoE, Llama, Bailing MoE, and MiMo-7B, indicating ongoing optimization efforts.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
24
Issues (30d)
18
Star History
15 stars in the last 30 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), Elvis Saravia Elvis Saravia(Founder of DAIR.AI), and
2 more.

vllm-omni by vllm-project

5.2%
4k
Omni-modality model inference and serving framework
Created 7 months ago
Updated 1 day ago
Feedback? Help us improve.