tokasaurus by ScalingIntelligence

LLM inference engine for high-throughput workloads

Created 7 months ago

458 stars

Top 66.0% on SourcePulse

View on GitHub

4 Experts Love This Project

Will Brown

Research Lead at Prime Intellect

Shishir Patil

Author of BFCL, Gorilla

Zhiqiang Xie

Coauthor of SGLang

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Project Summary

Tokasaurus is an LLM inference engine designed for high-throughput workloads, targeting researchers and engineers needing to serve large language models efficiently. It offers OpenAI-compatible APIs and advanced features like dynamic parallelism and optimized KV caching to maximize throughput and minimize latency.

How It Works

Tokasaurus employs a multi-process architecture with a shared web server for load balancing across data-parallel replicas, each with its own manager and model worker processes. Key innovations include a sophisticated scheduler that forecasts KV cache availability for aggressive sequence onboarding, Hydragen for efficient attention over shared prefixes, and end-to-end torch.compile with dynamic shapes and CUDA graphs for performance. Paged KV caching with prefix caching further optimizes memory usage.

Quick Start & Requirements

Install via pip: pip install tokasaurus or pip install -e . from source.
Launch engine: toka model=meta-llama/Llama-3.2-1B-Instruct
Ping engine: Use OpenAI client (e.g., client.completions.create(...)).
Supports Python >= 3.10.
Requires NVIDIA GPUs and CUDA. Multi-GPU support requires specifying dp_size, pp_size, tp_size.
Official blog post: https://scalingintelligence.stanford.edu/blogs/tokasaurus/

Highlighted Details

Supports OpenAI chat, completions, and batch APIs.
Implements data, pipeline, and tensor parallelism, including AsyncTP.
Features Paged KV caching with prefix caching and Hydragen for efficient attention.
Offers end-to-end torch.compile with dynamic shapes and CUDA graphs.
Includes a scheduler for aggressive sequence onboarding and KV cache management.

Maintenance & Community

Project authors include researchers from Stanford.
Citation available for research use.

Licensing & Compatibility

License not explicitly stated in the README.

Limitations & Caveats

Described as a new project with potentially rough edges.
torch.compile can increase server startup time.
Hydragen may introduce slight numerical differences due to bfloat16 aggregation.

Health Check

Last Commit

1 month ago

Responsiveness

1+ week

Pull Requests (30d)

Issues (30d)

Star History

4 stars in the last 30 days