SwiftInfer  by hpcaitech

TensorRT implementation for StreamingLLM

Created 1 year ago
478 stars

Top 64.1% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

SwiftInfer provides a production-grade implementation of Streaming-LLM, a technique for efficient AI inference and serving of large language models with infinite input length. It targets researchers and engineers seeking to deploy LLMs in real-time applications, offering improved performance and stability over PyTorch-based solutions.

How It Works

SwiftInfer leverages the TensorRT-LLM project to implement the Streaming-LLM technique, which uses "Attention Sinks" to prevent model collapse with long, streaming inputs. This TensorRT-based approach offers significant performance advantages for production deployment compared to the original PyTorch implementation.

Quick Start & Requirements

  • Installation: Clone the repository and install via pip install .. Requires a pre-built TensorRT-LLM v0.6.0 (commit 42af740db51d6f11442fd5509ef745a4c043ce51).
  • Prerequisites: Python, build essentials, CUDA toolkit (>= 12.2), cuDNN, NCCL, TensorRT (>= 9.1.0), PyTorch.
  • Setup: Manual installation of TensorRT-LLM is required if not using Docker. The README provides scripts for TensorRT installation.
  • Example: Detailed steps for running a Llama example, including engine building and inference, are provided.
  • Documentation: TensorRT-LLM Documentation

Highlighted Details

  • Production-grade TensorRT implementation of Streaming-LLM.
  • Built upon TensorRT-LLM v0.6.0.
  • Includes examples for multi-round conversation with Llama models.
  • Benchmarks show performance improvements over PyTorch.

Maintenance & Community

The project is associated with hpcaitech. The README mentions ongoing work to adapt to newer TensorRT-LLM APIs (v0.7.1) and notes that TensorRT-LLM has integrated StreamingLLM examples.

Licensing & Compatibility

The repository itself is not explicitly licensed in the README. However, it is built upon TensorRT-LLM, which is typically distributed under a permissive license allowing commercial use. Users should verify the specific license of the TensorRT-LLM version used.

Limitations & Caveats

SwiftInfer is tightly coupled to a specific, older commit of TensorRT-LLM (v0.6.0), which may require manual effort to update as TensorRT-LLM evolves. The project acknowledges that TensorRT-LLM's own StreamingLLM examples are more suited for single text generation than multi-round conversations.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
5 more.

streaming-llm by mit-han-lab

0.1%
7k
Framework for efficient LLM streaming
Created 2 years ago
Updated 1 year ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Lei Zhang Lei Zhang(Director Engineering AI at AMD), and
23 more.

gpt-fast by meta-pytorch

0.2%
6k
PyTorch text generation for efficient transformer inference
Created 2 years ago
Updated 1 month ago
Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
20 more.

TensorRT-LLM by NVIDIA

0.4%
12k
LLM inference optimization SDK for NVIDIA GPUs
Created 2 years ago
Updated 16 hours ago
Feedback? Help us improve.