Qwen-TensorRT-LLM  by Tlntin

TensorRT-LLM acceleration for Qwen models

Created 2 years ago
620 stars

Top 53.2% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides optimized inference for Qwen large language models using NVIDIA TensorRT-LLM. It targets developers and researchers needing high-performance LLM deployment on NVIDIA GPUs, offering significant speedups and reduced memory footprints through various quantization techniques.

How It Works

The project leverages TensorRT-LLM's optimized kernels and graph optimizations to accelerate Qwen models. It supports multiple quantization schemes including FP16, BF16 (experimental), INT8 (Weight-Only and SmoothQuant), and INT4 (Weight-Only, AWQ, GPTQ). Features like KV cache quantization, Tensor Parallelism for multi-GPU setups, and integration with Triton Inference Server for high-throughput deployment are included.

Quick Start & Requirements

  • Install: Use the provided Docker image (nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3).
  • Prerequisites: NVIDIA GPU with sufficient VRAM (e.g., 4GB for INT4 1.8B, 21GB for FP16 7B), CUDA, Docker, nvidia-docker. Linux is recommended.
  • Setup: Clone the repo, pull the Docker image, and run a container. Install Python dependencies within the container.
  • Docs: Bilibili Tutorial, Blog Post

Highlighted Details

  • Supports a wide range of Qwen models including Qwen2, Qwen, Qwen-VL, and CodeQwen.
  • Offers various quantization methods: FP16, INT8 (Weight-Only, Smooth Quant), INT4 (Weight-Only, AWQ, GPTQ).
  • Enables Tensor Parallelism for multi-GPU inference.
  • Provides deployment options via Triton Inference Server and a FastAPI server compatible with OpenAI API.
  • Includes a Gradio-based web demo for interactive use.

Maintenance & Community

The project notes that the official TensorRT-LLM main branch now supports Qwen/Qwen2, and this repository will not receive major updates.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would depend on the underlying TensorRT-LLM license and the specific Qwen model licenses.

Limitations & Caveats

Windows support is experimental. The project primarily focuses on Qwen versions up to 0.7.0, with Qwen2 being the recommended and actively maintained model. Some advanced features like INT8 KV cache calibration require sufficient GPU memory to load the full model.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
4 stars in the last 30 days

Explore Similar Projects

Starred by Jeremy Howard Jeremy Howard(Cofounder of fast.ai), Sasha Rush Sasha Rush(Research Scientist at Cursor; Professor at Cornell Tech), and
1 more.

GPTQ-triton by fpgaminer

0%
307
Triton kernel for GPTQ inference, improving context scaling
Created 2 years ago
Updated 2 years ago
Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), Nikola Borisov Nikola Borisov(Founder and CEO of DeepInfra), and
3 more.

tensorrtllm_backend by triton-inference-server

0.2%
889
Triton backend for serving TensorRT-LLM models
Created 2 years ago
Updated 1 day ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
4 more.

gemma_pytorch by google

0.2%
6k
PyTorch implementation for Google's Gemma models
Created 1 year ago
Updated 3 months ago
Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
20 more.

TensorRT-LLM by NVIDIA

0.5%
12k
LLM inference optimization SDK for NVIDIA GPUs
Created 2 years ago
Updated 17 hours ago
Feedback? Help us improve.