tokasaurus  by ScalingIntelligence

LLM inference engine for high-throughput workloads

Created 3 months ago
418 stars

Top 70.3% on SourcePulse

GitHubView on GitHub
Project Summary

Tokasaurus is an LLM inference engine designed for high-throughput workloads, targeting researchers and engineers needing to serve large language models efficiently. It offers OpenAI-compatible APIs and advanced features like dynamic parallelism and optimized KV caching to maximize throughput and minimize latency.

How It Works

Tokasaurus employs a multi-process architecture with a shared web server for load balancing across data-parallel replicas, each with its own manager and model worker processes. Key innovations include a sophisticated scheduler that forecasts KV cache availability for aggressive sequence onboarding, Hydragen for efficient attention over shared prefixes, and end-to-end torch.compile with dynamic shapes and CUDA graphs for performance. Paged KV caching with prefix caching further optimizes memory usage.

Quick Start & Requirements

  • Install via pip: pip install tokasaurus or pip install -e . from source.
  • Launch engine: toka model=meta-llama/Llama-3.2-1B-Instruct
  • Ping engine: Use OpenAI client (e.g., client.completions.create(...)).
  • Supports Python >= 3.10.
  • Requires NVIDIA GPUs and CUDA. Multi-GPU support requires specifying dp_size, pp_size, tp_size.
  • Official blog post: https://scalingintelligence.stanford.edu/blogs/tokasaurus/

Highlighted Details

  • Supports OpenAI chat, completions, and batch APIs.
  • Implements data, pipeline, and tensor parallelism, including AsyncTP.
  • Features Paged KV caching with prefix caching and Hydragen for efficient attention.
  • Offers end-to-end torch.compile with dynamic shapes and CUDA graphs.
  • Includes a scheduler for aggressive sequence onboarding and KV cache management.

Maintenance & Community

  • Project authors include researchers from Stanford.
  • Citation available for research use.

Licensing & Compatibility

  • License not explicitly stated in the README.

Limitations & Caveats

  • Described as a new project with potentially rough edges.
  • torch.compile can increase server startup time.
  • Hydragen may introduce slight numerical differences due to bfloat16 aggregation.
Health Check
Last Commit

3 weeks ago

Responsiveness

1+ week

Pull Requests (30d)
1
Issues (30d)
1
Star History
21 stars in the last 30 days

Explore Similar Projects

Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), Nikola Borisov Nikola Borisov(Founder and CEO of DeepInfra), and
3 more.

tensorrtllm_backend by triton-inference-server

0.2%
889
Triton backend for serving TensorRT-LLM models
Created 2 years ago
Updated 1 day ago
Starred by Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
11 more.

mistral.rs by EricLBuehler

0.3%
6k
LLM inference engine for blazing fast performance
Created 1 year ago
Updated 22 hours ago
Starred by Carol Willing Carol Willing(Core Contributor to CPython, Jupyter), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
9 more.

dynamo by ai-dynamo

1.0%
5k
Inference framework for distributed generative AI model serving
Created 6 months ago
Updated 13 hours ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Clement Delangue Clement Delangue(Cofounder of Hugging Face), and
58 more.

vllm by vllm-project

1.1%
58k
LLM serving engine for high-throughput, memory-efficient inference
Created 2 years ago
Updated 12 hours ago
Feedback? Help us improve.