sglang-jax  by sgl-project

High-performance LLM inference engine for JAX/TPU serving

Created 10 months ago
275 stars

Top 94.0% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

SGL-JAX is a high-performance, JAX-based inference engine for Large Language Models (LLMs), optimized for Google TPUs. It targets demanding LLM serving workloads by delivering exceptional throughput and low latency through state-of-the-art techniques for maximum hardware utilization.

How It Works

The engine employs an OpenAI-compatible HTTP API and a sophisticated scheduler for high-throughput continuous batching. It utilizes an optimized KV cache with a Radix Tree for memory-efficient prefix sharing and integrates FlashAttention kernels. Native tensor parallelism distributes large models across multiple TPU devices for scalable inference.

Quick Start & Requirements

Installation and quick start guides are detailed in the project's documentation. Primary requirements include a JAX/TPU environment. Further setup and usage instructions are available in the docs directory.

Highlighted Details

  • Features high-throughput continuous batching and an optimized KV cache with Radix Tree.
  • Integrates FlashAttention kernels and supports native tensor parallelism.
  • Provides an OpenAI-compatible API.
  • Offers first-class support for Qwen models (including MoE), Qwen 3 series (best performance), and multimodal models (Text-to-Video, Qwen2.5-VL).
  • Supports Eagle's Speculative Decoding for MiMo-7B.

Maintenance & Community

Contribution guidelines are provided. Community discussions occur on the SGLang Slack workspace. No specific details on core maintainers, sponsorships, or a public roadmap are present.

Licensing & Compatibility

The README does not specify a software license, requiring further investigation for usage restrictions, especially for commercial applications.

Limitations & Caveats

Performance requires improvement for several supported models, including Qwen, Qwen 2, Qwen 2 MoE, Llama, Bailing MoE, and MiMo-7B, indicating ongoing optimization efforts.

Health Check
Last Commit

9 hours ago

Responsiveness

Inactive

Pull Requests (30d)
160
Issues (30d)
85
Star History
8 stars in the last 30 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), Elvis Saravia Elvis Saravia(Founder of DAIR.AI), and
2 more.

vllm-omni by vllm-project

1.4%
5k
Omni-modality model inference and serving framework
Created 8 months ago
Updated 11 hours ago
Feedback? Help us improve.