LLM inference engine optimized for throughput and memory on XLA devices
Top 78.4% on sourcepulse
JetStream is a high-throughput, memory-optimized inference engine for Large Language Models (LLMs) targeting XLA devices, with initial support for TPUs and future expansion to GPUs. It aims to provide efficient LLM serving for researchers and developers working with large models on specialized hardware.
How It Works
JetStream offers two distinct engine implementations: one built on JAX (leveraging MaxText) and another for PyTorch. This dual approach allows users to choose the framework that best suits their existing model development pipeline. The engine is designed for throughput and memory efficiency, crucial for deploying large LLMs.
Quick Start & Requirements
make install-deps
(followed by running server/testing commands).Highlighted Details
Maintenance & Community
The project is hosted by AI-Hypercomputer, a Google entity. Specific community channels or active contributor details are not provided in the README.
Licensing & Compatibility
The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
Current support is primarily focused on TPUs, with GPU support listed as a future goal and open to contributions. The setup guide points to Cloud TPU VMs, suggesting limited out-of-the-box local development experience without specific hardware.
1 month ago
1 week