TensorRT implementation for StreamingLLM
Top 64.3% on SourcePulse
SwiftInfer provides a production-grade implementation of Streaming-LLM, a technique for efficient AI inference and serving of large language models with infinite input length. It targets researchers and engineers seeking to deploy LLMs in real-time applications, offering improved performance and stability over PyTorch-based solutions.
How It Works
SwiftInfer leverages the TensorRT-LLM project to implement the Streaming-LLM technique, which uses "Attention Sinks" to prevent model collapse with long, streaming inputs. This TensorRT-based approach offers significant performance advantages for production deployment compared to the original PyTorch implementation.
Quick Start & Requirements
pip install .
. Requires a pre-built TensorRT-LLM v0.6.0 (commit 42af740db51d6f11442fd5509ef745a4c043ce51
).Highlighted Details
Maintenance & Community
The project is associated with hpcaitech. The README mentions ongoing work to adapt to newer TensorRT-LLM APIs (v0.7.1) and notes that TensorRT-LLM has integrated StreamingLLM examples.
Licensing & Compatibility
The repository itself is not explicitly licensed in the README. However, it is built upon TensorRT-LLM, which is typically distributed under a permissive license allowing commercial use. Users should verify the specific license of the TensorRT-LLM version used.
Limitations & Caveats
SwiftInfer is tightly coupled to a specific, older commit of TensorRT-LLM (v0.6.0), which may require manual effort to update as TensorRT-LLM evolves. The project acknowledges that TensorRT-LLM's own StreamingLLM examples are more suited for single text generation than multi-round conversations.
1 year ago
Inactive