Discover and explore top open-source AI tools and projects—updated daily.
lightseekorgSpeed-of-light LLM inference engine
New!
Top 41.1% on SourcePulse
Summary
TokenSpeed is an LLM inference engine engineered for high-performance agentic workloads, aiming to match TensorRT-LLM's speed with vLLM's ease of use. It targets users and organizations requiring efficient, production-ready inference for complex AI agents, offering a significant performance boost through its specialized design.
How It Works
TokenSpeed employs a unique local-SPMD design within its modeling layer, utilizing a static compiler to automatically generate collective communication patterns from module-boundary annotations, eliminating the need for manual parallelism configuration. The scheduler features a C++ control plane and Python execution plane, managing request lifecycles and KV cache ownership via a finite-state machine, with compile-time type system enforcement for safe KV resource reuse. Its pluggable kernel system includes optimized implementations like fast Multi-head Latent Attention (MLA) on Blackwell hardware, and an SMG-integrated AsyncLLM entrypoint ensures low-overhead CPU-side request handling.
Quick Start & Requirements
This is a preview release under heavy development. Specific installation commands, Python versions, or explicit dependency lists (beyond implied GPU requirements like B200, Blackwell, Hopper, MI350) are not detailed in the provided README excerpt. Links to "Docs Index", "Getting Started", and "Launching a Server" are mentioned.
Highlighted Details
Maintenance & Community
The project is currently under heavy development, with several major pull requests in progress and planned merges over the coming weeks. Specific details on contributors, community channels (like Discord/Slack), or roadmaps are not provided in the excerpt.
Licensing & Compatibility
The license type and any compatibility notes for commercial or closed-source use are not specified in the provided README content.
Limitations & Caveats
This release is explicitly marked as a preview and is not intended for production deployments. Key features like broader model coverage (Qwen, DeepSeek, MiniMax), advanced runtime functionalities (PD, EPLB, VLM), and optimizations for specific platforms (Hopper, MI350) are still under development and will be merged incrementally.
3 hours ago
Inactive
ModelTC
ai-dynamo
mlc-ai