tokasaurus  by ScalingIntelligence

LLM inference engine for high-throughput workloads

created 1 month ago
386 stars

Top 75.3% on sourcepulse

GitHubView on GitHub
Project Summary

Tokasaurus is an LLM inference engine designed for high-throughput workloads, targeting researchers and engineers needing to serve large language models efficiently. It offers OpenAI-compatible APIs and advanced features like dynamic parallelism and optimized KV caching to maximize throughput and minimize latency.

How It Works

Tokasaurus employs a multi-process architecture with a shared web server for load balancing across data-parallel replicas, each with its own manager and model worker processes. Key innovations include a sophisticated scheduler that forecasts KV cache availability for aggressive sequence onboarding, Hydragen for efficient attention over shared prefixes, and end-to-end torch.compile with dynamic shapes and CUDA graphs for performance. Paged KV caching with prefix caching further optimizes memory usage.

Quick Start & Requirements

  • Install via pip: pip install tokasaurus or pip install -e . from source.
  • Launch engine: toka model=meta-llama/Llama-3.2-1B-Instruct
  • Ping engine: Use OpenAI client (e.g., client.completions.create(...)).
  • Supports Python >= 3.10.
  • Requires NVIDIA GPUs and CUDA. Multi-GPU support requires specifying dp_size, pp_size, tp_size.
  • Official blog post: https://scalingintelligence.stanford.edu/blogs/tokasaurus/

Highlighted Details

  • Supports OpenAI chat, completions, and batch APIs.
  • Implements data, pipeline, and tensor parallelism, including AsyncTP.
  • Features Paged KV caching with prefix caching and Hydragen for efficient attention.
  • Offers end-to-end torch.compile with dynamic shapes and CUDA graphs.
  • Includes a scheduler for aggressive sequence onboarding and KV cache management.

Maintenance & Community

  • Project authors include researchers from Stanford.
  • Citation available for research use.

Licensing & Compatibility

  • License not explicitly stated in the README.

Limitations & Caveats

  • Described as a new project with potentially rough edges.
  • torch.compile can increase server startup time.
  • Hydragen may introduce slight numerical differences due to bfloat16 aggregation.
Health Check
Last commit

3 days ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
0
Star History
390 stars in the last 90 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera) and Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

InternEvo by InternLM

1.0%
402
Lightweight training framework for model pre-training
created 1 year ago
updated 1 week ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
2 more.

S-LoRA by S-LoRA

0.1%
2k
System for scalable LoRA adapter serving
created 1 year ago
updated 1 year ago
Feedback? Help us improve.