Qwen-TensorRT-LLM  by Tlntin

TensorRT-LLM acceleration for Qwen models

created 1 year ago
616 stars

Top 54.3% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides optimized inference for Qwen large language models using NVIDIA TensorRT-LLM. It targets developers and researchers needing high-performance LLM deployment on NVIDIA GPUs, offering significant speedups and reduced memory footprints through various quantization techniques.

How It Works

The project leverages TensorRT-LLM's optimized kernels and graph optimizations to accelerate Qwen models. It supports multiple quantization schemes including FP16, BF16 (experimental), INT8 (Weight-Only and SmoothQuant), and INT4 (Weight-Only, AWQ, GPTQ). Features like KV cache quantization, Tensor Parallelism for multi-GPU setups, and integration with Triton Inference Server for high-throughput deployment are included.

Quick Start & Requirements

  • Install: Use the provided Docker image (nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3).
  • Prerequisites: NVIDIA GPU with sufficient VRAM (e.g., 4GB for INT4 1.8B, 21GB for FP16 7B), CUDA, Docker, nvidia-docker. Linux is recommended.
  • Setup: Clone the repo, pull the Docker image, and run a container. Install Python dependencies within the container.
  • Docs: Bilibili Tutorial, Blog Post

Highlighted Details

  • Supports a wide range of Qwen models including Qwen2, Qwen, Qwen-VL, and CodeQwen.
  • Offers various quantization methods: FP16, INT8 (Weight-Only, Smooth Quant), INT4 (Weight-Only, AWQ, GPTQ).
  • Enables Tensor Parallelism for multi-GPU inference.
  • Provides deployment options via Triton Inference Server and a FastAPI server compatible with OpenAI API.
  • Includes a Gradio-based web demo for interactive use.

Maintenance & Community

The project notes that the official TensorRT-LLM main branch now supports Qwen/Qwen2, and this repository will not receive major updates.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would depend on the underlying TensorRT-LLM license and the specific Qwen model licenses.

Limitations & Caveats

Windows support is experimental. The project primarily focuses on Qwen versions up to 0.7.0, with Qwen2 being the recommended and actively maintained model. Some advanced features like INT8 KV cache calibration require sufficient GPU memory to load the full model.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
11 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Ying Sheng Ying Sheng(Author of SGLang).

fastllm by ztxz16

0.4%
4k
High-performance C++ LLM inference library
created 2 years ago
updated 2 weeks ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
6 more.

AutoGPTQ by AutoGPTQ

0.1%
5k
LLM quantization package using GPTQ algorithm
created 2 years ago
updated 3 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
5 more.

TensorRT-LLM by NVIDIA

0.6%
11k
LLM inference optimization SDK for NVIDIA GPUs
created 1 year ago
updated 8 hours ago
Feedback? Help us improve.