SwiftInfer by hpcaitech

TensorRT implementation for StreamingLLM

Created 1 year ago

478 stars

Top 64.0% on SourcePulse

View on GitHub

1 Expert Loves This Project

Elvis Saravia

Founder of DAIR.AI

Project Summary

SwiftInfer provides a production-grade implementation of Streaming-LLM, a technique for efficient AI inference and serving of large language models with infinite input length. It targets researchers and engineers seeking to deploy LLMs in real-time applications, offering improved performance and stability over PyTorch-based solutions.

How It Works

SwiftInfer leverages the TensorRT-LLM project to implement the Streaming-LLM technique, which uses "Attention Sinks" to prevent model collapse with long, streaming inputs. This TensorRT-based approach offers significant performance advantages for production deployment compared to the original PyTorch implementation.

Quick Start & Requirements

Installation: Clone the repository and install via pip install .. Requires a pre-built TensorRT-LLM v0.6.0 (commit 42af740db51d6f11442fd5509ef745a4c043ce51).
Prerequisites: Python, build essentials, CUDA toolkit (>= 12.2), cuDNN, NCCL, TensorRT (>= 9.1.0), PyTorch.
Setup: Manual installation of TensorRT-LLM is required if not using Docker. The README provides scripts for TensorRT installation.
Example: Detailed steps for running a Llama example, including engine building and inference, are provided.
Documentation: TensorRT-LLM Documentation

Highlighted Details

Production-grade TensorRT implementation of Streaming-LLM.
Built upon TensorRT-LLM v0.6.0.
Includes examples for multi-round conversation with Llama models.
Benchmarks show performance improvements over PyTorch.

Maintenance & Community

The project is associated with hpcaitech. The README mentions ongoing work to adapt to newer TensorRT-LLM APIs (v0.7.1) and notes that TensorRT-LLM has integrated StreamingLLM examples.

Licensing & Compatibility

The repository itself is not explicitly licensed in the README. However, it is built upon TensorRT-LLM, which is typically distributed under a permissive license allowing commercial use. Users should verify the specific license of the TensorRT-LLM version used.

Limitations & Caveats

SwiftInfer is tightly coupled to a specific, older commit of TensorRT-LLM (v0.6.0), which may require manual effort to update as TensorRT-LLM evolves. The project acknowledges that TensorRT-LLM's own StreamingLLM examples are more suited for single text generation than multi-round conversations.

SwiftInfer by hpcaitech

Explore Similar Projects

llama-nuts-and-bolts by adalkiran

Building-a-Small-LLM-from-Scratch by KaihuaTang

llms-from-scratch-rs by nerdai

neural-speed by intel

Megatron-LLM by epfLLM

bert4torch by Tongjilibo

ppl.nn by OpenPPL

deep-learning-pytorch-huggingface by philschmid

streaming-llm by mit-han-lab

gpt-fast by meta-pytorch

text-generation-inference by huggingface

TensorRT-LLM by NVIDIA