SwiftInfer  by hpcaitech

TensorRT implementation for StreamingLLM

created 1 year ago
472 stars

Top 64.3% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

SwiftInfer provides a production-grade implementation of Streaming-LLM, a technique for efficient AI inference and serving of large language models with infinite input length. It targets researchers and engineers seeking to deploy LLMs in real-time applications, offering improved performance and stability over PyTorch-based solutions.

How It Works

SwiftInfer leverages the TensorRT-LLM project to implement the Streaming-LLM technique, which uses "Attention Sinks" to prevent model collapse with long, streaming inputs. This TensorRT-based approach offers significant performance advantages for production deployment compared to the original PyTorch implementation.

Quick Start & Requirements

  • Installation: Clone the repository and install via pip install .. Requires a pre-built TensorRT-LLM v0.6.0 (commit 42af740db51d6f11442fd5509ef745a4c043ce51).
  • Prerequisites: Python, build essentials, CUDA toolkit (>= 12.2), cuDNN, NCCL, TensorRT (>= 9.1.0), PyTorch.
  • Setup: Manual installation of TensorRT-LLM is required if not using Docker. The README provides scripts for TensorRT installation.
  • Example: Detailed steps for running a Llama example, including engine building and inference, are provided.
  • Documentation: TensorRT-LLM Documentation

Highlighted Details

  • Production-grade TensorRT implementation of Streaming-LLM.
  • Built upon TensorRT-LLM v0.6.0.
  • Includes examples for multi-round conversation with Llama models.
  • Benchmarks show performance improvements over PyTorch.

Maintenance & Community

The project is associated with hpcaitech. The README mentions ongoing work to adapt to newer TensorRT-LLM APIs (v0.7.1) and notes that TensorRT-LLM has integrated StreamingLLM examples.

Licensing & Compatibility

The repository itself is not explicitly licensed in the README. However, it is built upon TensorRT-LLM, which is typically distributed under a permissive license allowing commercial use. Users should verify the specific license of the TensorRT-LLM version used.

Limitations & Caveats

SwiftInfer is tightly coupled to a specific, older commit of TensorRT-LLM (v0.6.0), which may require manual effort to update as TensorRT-LLM evolves. The project acknowledges that TensorRT-LLM's own StreamingLLM examples are more suited for single text generation than multi-round conversations.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Zhuohan Li Zhuohan Li(Author of vLLM), and
7 more.

torchtitan by pytorch

1.2%
4k
PyTorch platform for generative AI model training research
created 1 year ago
updated 1 day ago
Starred by Nat Friedman Nat Friedman(Former CEO of GitHub), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
10 more.

FasterTransformer by NVIDIA

0.1%
6k
Optimized transformer library for inference
created 4 years ago
updated 1 year ago
Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Shizhe Diao Shizhe Diao(Research Scientist at NVIDIA; Author of LMFlow), and
13 more.

TensorRT-LLM by NVIDIA

0.5%
11k
LLM inference optimization SDK for NVIDIA GPUs
created 2 years ago
updated 1 day ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Eric Zhang Eric Zhang(Founding Engineer at Modal), and
9 more.

flux by black-forest-labs

0.4%
24k
Inference code for FLUX image generation & editing models
created 1 year ago
updated 2 weeks ago
Feedback? Help us improve.