Discover and explore top open-source AI tools and projects—updated daily.
ScalingIntelligenceLLM inference engine for high-throughput workloads
Top 66.0% on SourcePulse
Tokasaurus is an LLM inference engine designed for high-throughput workloads, targeting researchers and engineers needing to serve large language models efficiently. It offers OpenAI-compatible APIs and advanced features like dynamic parallelism and optimized KV caching to maximize throughput and minimize latency.
How It Works
Tokasaurus employs a multi-process architecture with a shared web server for load balancing across data-parallel replicas, each with its own manager and model worker processes. Key innovations include a sophisticated scheduler that forecasts KV cache availability for aggressive sequence onboarding, Hydragen for efficient attention over shared prefixes, and end-to-end torch.compile with dynamic shapes and CUDA graphs for performance. Paged KV caching with prefix caching further optimizes memory usage.
Quick Start & Requirements
pip install tokasaurus or pip install -e . from source.toka model=meta-llama/Llama-3.2-1B-Instructclient.completions.create(...)).dp_size, pp_size, tp_size.Highlighted Details
torch.compile with dynamic shapes and CUDA graphs.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
torch.compile can increase server startup time.1 month ago
1+ week
zhihu
triton-inference-server
b4rtaz
EricLBuehler
ai-dynamo
huggingface
vllm-project