LLM inference engine for high-throughput workloads
Top 75.3% on sourcepulse
Tokasaurus is an LLM inference engine designed for high-throughput workloads, targeting researchers and engineers needing to serve large language models efficiently. It offers OpenAI-compatible APIs and advanced features like dynamic parallelism and optimized KV caching to maximize throughput and minimize latency.
How It Works
Tokasaurus employs a multi-process architecture with a shared web server for load balancing across data-parallel replicas, each with its own manager and model worker processes. Key innovations include a sophisticated scheduler that forecasts KV cache availability for aggressive sequence onboarding, Hydragen for efficient attention over shared prefixes, and end-to-end torch.compile
with dynamic shapes and CUDA graphs for performance. Paged KV caching with prefix caching further optimizes memory usage.
Quick Start & Requirements
pip install tokasaurus
or pip install -e .
from source.toka model=meta-llama/Llama-3.2-1B-Instruct
client.completions.create(...)
).dp_size
, pp_size
, tp_size
.Highlighted Details
torch.compile
with dynamic shapes and CUDA graphs.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
torch.compile
can increase server startup time.3 days ago
Inactive