Qwen-TensorRT-LLM by Tlntin

TensorRT-LLM acceleration for Qwen models

Created 2 years ago

622 stars

Top 53.1% on SourcePulse

Project Summary

This repository provides optimized inference for Qwen large language models using NVIDIA TensorRT-LLM. It targets developers and researchers needing high-performance LLM deployment on NVIDIA GPUs, offering significant speedups and reduced memory footprints through various quantization techniques.

How It Works

The project leverages TensorRT-LLM's optimized kernels and graph optimizations to accelerate Qwen models. It supports multiple quantization schemes including FP16, BF16 (experimental), INT8 (Weight-Only and SmoothQuant), and INT4 (Weight-Only, AWQ, GPTQ). Features like KV cache quantization, Tensor Parallelism for multi-GPU setups, and integration with Triton Inference Server for high-throughput deployment are included.

Quick Start & Requirements

Install: Use the provided Docker image (nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3).
Prerequisites: NVIDIA GPU with sufficient VRAM (e.g., 4GB for INT4 1.8B, 21GB for FP16 7B), CUDA, Docker, nvidia-docker. Linux is recommended.
Setup: Clone the repo, pull the Docker image, and run a container. Install Python dependencies within the container.
Docs: Bilibili Tutorial, Blog Post

Highlighted Details

Supports a wide range of Qwen models including Qwen2, Qwen, Qwen-VL, and CodeQwen.
Offers various quantization methods: FP16, INT8 (Weight-Only, Smooth Quant), INT4 (Weight-Only, AWQ, GPTQ).
Enables Tensor Parallelism for multi-GPU inference.
Provides deployment options via Triton Inference Server and a FastAPI server compatible with OpenAI API.
Includes a Gradio-based web demo for interactive use.

Maintenance & Community

The project notes that the official TensorRT-LLM main branch now supports Qwen/Qwen2, and this repository will not receive major updates.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would depend on the underlying TensorRT-LLM license and the specific Qwen model licenses.

Limitations & Caveats

Windows support is experimental. The project primarily focuses on Qwen versions up to 0.7.0, with Qwen2 being the recommended and actively maintained model. Some advanced features like INT8 KV cache calibration require sufficient GPU memory to load the full model.

Qwen-TensorRT-LLM by Tlntin

Explore Similar Projects

GPTQ-triton by fpgaminer

ScaleLLM by vectorch-ai

crabml by crabml

candle-vllm by EricLBuehler

ZhiLight by zhihu

KuiperLLama by zjhellofss

tensorrtllm_backend by triton-inference-server

rtp-llm by alibaba

GPTQModel by ModelCloud

ik_llama.cpp by ikawrakow

gemma_pytorch by google

TensorRT-LLM by NVIDIA