tokenspeed by lightseekorg

Speed-of-light LLM inference engine

Created 3 days ago

New!

863 stars

Top 41.1% on SourcePulse

View on GitHub

5 Experts Love This Project

Chaoyu Yang

Founder of Bento

Jeff Hammerbacher

Cofounder of Cloudera

Luis Capelo

Cofounder of Lightning AI

Lei Zhang

Director Engineering AI at AMD

and 1 more!

Project Summary

Summary

TokenSpeed is an LLM inference engine engineered for high-performance agentic workloads, aiming to match TensorRT-LLM's speed with vLLM's ease of use. It targets users and organizations requiring efficient, production-ready inference for complex AI agents, offering a significant performance boost through its specialized design.

How It Works

TokenSpeed employs a unique local-SPMD design within its modeling layer, utilizing a static compiler to automatically generate collective communication patterns from module-boundary annotations, eliminating the need for manual parallelism configuration. The scheduler features a C++ control plane and Python execution plane, managing request lifecycles and KV cache ownership via a finite-state machine, with compile-time type system enforcement for safe KV resource reuse. Its pluggable kernel system includes optimized implementations like fast Multi-head Latent Attention (MLA) on Blackwell hardware, and an SMG-integrated AsyncLLM entrypoint ensures low-overhead CPU-side request handling.

Quick Start & Requirements

This is a preview release under heavy development. Specific installation commands, Python versions, or explicit dependency lists (beyond implied GPU requirements like B200, Blackwell, Hopper, MI350) are not detailed in the provided README excerpt. Links to "Docs Index", "Getting Started", and "Launching a Server" are mentioned.

Highlighted Details

Achieves TensorRT-LLM-level performance and vLLM-level usability for agentic workloads.
Features a static compiler for automatic parallelism generation, simplifying user experience.
Includes a robust scheduler with compile-time safety checks for KV cache management.
Offers optimized kernels, including a fast MLA implementation for agentic tasks on Blackwell GPUs.

Maintenance & Community

The project is currently under heavy development, with several major pull requests in progress and planned merges over the coming weeks. Specific details on contributors, community channels (like Discord/Slack), or roadmaps are not provided in the excerpt.

Licensing & Compatibility

The license type and any compatibility notes for commercial or closed-source use are not specified in the provided README content.

Limitations & Caveats

This release is explicitly marked as a preview and is not intended for production deployments. Key features like broader model coverage (Qwen, DeepSeek, MiniMax), advanced runtime functionalities (PD, EPLB, VLM), and optimizations for specific platforms (Hopper, MI350) are still under development and will be merged incrementally.

Health Check

Last Commit

3 hours ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

868 stars in the last 3 days