SwiftInfer  by hpcaitech

TensorRT implementation for StreamingLLM

Created 2 years ago
480 stars

Top 63.3% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

SwiftInfer provides a production-grade implementation of Streaming-LLM, a technique for efficient AI inference and serving of large language models with infinite input length. It targets researchers and engineers seeking to deploy LLMs in real-time applications, offering improved performance and stability over PyTorch-based solutions.

How It Works

SwiftInfer leverages the TensorRT-LLM project to implement the Streaming-LLM technique, which uses "Attention Sinks" to prevent model collapse with long, streaming inputs. This TensorRT-based approach offers significant performance advantages for production deployment compared to the original PyTorch implementation.

Quick Start & Requirements

  • Installation: Clone the repository and install via pip install .. Requires a pre-built TensorRT-LLM v0.6.0 (commit 42af740db51d6f11442fd5509ef745a4c043ce51).
  • Prerequisites: Python, build essentials, CUDA toolkit (>= 12.2), cuDNN, NCCL, TensorRT (>= 9.1.0), PyTorch.
  • Setup: Manual installation of TensorRT-LLM is required if not using Docker. The README provides scripts for TensorRT installation.
  • Example: Detailed steps for running a Llama example, including engine building and inference, are provided.
  • Documentation: TensorRT-LLM Documentation

Highlighted Details

  • Production-grade TensorRT implementation of Streaming-LLM.
  • Built upon TensorRT-LLM v0.6.0.
  • Includes examples for multi-round conversation with Llama models.
  • Benchmarks show performance improvements over PyTorch.

Maintenance & Community

The project is associated with hpcaitech. The README mentions ongoing work to adapt to newer TensorRT-LLM APIs (v0.7.1) and notes that TensorRT-LLM has integrated StreamingLLM examples.

Licensing & Compatibility

The repository itself is not explicitly licensed in the README. However, it is built upon TensorRT-LLM, which is typically distributed under a permissive license allowing commercial use. Users should verify the specific license of the TensorRT-LLM version used.

Limitations & Caveats

SwiftInfer is tightly coupled to a specific, older commit of TensorRT-LLM (v0.6.0), which may require manual effort to update as TensorRT-LLM evolves. The project acknowledges that TensorRT-LLM's own StreamingLLM examples are more suited for single text generation than multi-round conversations.

Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Lei Zhang Lei Zhang(Director Engineering AI at AMD), and
23 more.

gpt-fast by meta-pytorch

0.1%
6k
PyTorch text generation for efficient transformer inference
Created 2 years ago
Updated 9 months ago
Feedback? Help us improve.