Qwen-TensorRT-LLM  by Tlntin

TensorRT-LLM acceleration for Qwen models

Created 2 years ago
622 stars

Top 53.1% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides optimized inference for Qwen large language models using NVIDIA TensorRT-LLM. It targets developers and researchers needing high-performance LLM deployment on NVIDIA GPUs, offering significant speedups and reduced memory footprints through various quantization techniques.

How It Works

The project leverages TensorRT-LLM's optimized kernels and graph optimizations to accelerate Qwen models. It supports multiple quantization schemes including FP16, BF16 (experimental), INT8 (Weight-Only and SmoothQuant), and INT4 (Weight-Only, AWQ, GPTQ). Features like KV cache quantization, Tensor Parallelism for multi-GPU setups, and integration with Triton Inference Server for high-throughput deployment are included.

Quick Start & Requirements

  • Install: Use the provided Docker image (nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3).
  • Prerequisites: NVIDIA GPU with sufficient VRAM (e.g., 4GB for INT4 1.8B, 21GB for FP16 7B), CUDA, Docker, nvidia-docker. Linux is recommended.
  • Setup: Clone the repo, pull the Docker image, and run a container. Install Python dependencies within the container.
  • Docs: Bilibili Tutorial, Blog Post

Highlighted Details

  • Supports a wide range of Qwen models including Qwen2, Qwen, Qwen-VL, and CodeQwen.
  • Offers various quantization methods: FP16, INT8 (Weight-Only, Smooth Quant), INT4 (Weight-Only, AWQ, GPTQ).
  • Enables Tensor Parallelism for multi-GPU inference.
  • Provides deployment options via Triton Inference Server and a FastAPI server compatible with OpenAI API.
  • Includes a Gradio-based web demo for interactive use.

Maintenance & Community

The project notes that the official TensorRT-LLM main branch now supports Qwen/Qwen2, and this repository will not receive major updates.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would depend on the underlying TensorRT-LLM license and the specific Qwen model licenses.

Limitations & Caveats

Windows support is experimental. The project primarily focuses on Qwen versions up to 0.7.0, with Qwen2 being the recommended and actively maintained model. Some advanced features like INT8 KV cache calibration require sufficient GPU memory to load the full model.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Jeremy Howard Jeremy Howard(Cofounder of fast.ai), Sasha Rush Sasha Rush(Research Scientist at Cursor; Professor at Cornell Tech), and
1 more.

GPTQ-triton by fpgaminer

0%
316
Triton kernel for GPTQ inference, improving context scaling
Created 2 years ago
Updated 2 years ago
Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), Nikola Borisov Nikola Borisov(Founder and CEO of DeepInfra), and
3 more.

tensorrtllm_backend by triton-inference-server

0.1%
912
Triton backend for serving TensorRT-LLM models
Created 2 years ago
Updated 2 days ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI).

rtp-llm by alibaba

0.7%
995
LLM inference engine for diverse applications
Created 2 years ago
Updated 14 hours ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
4 more.

gemma_pytorch by google

0.1%
6k
PyTorch implementation for Google's Gemma models
Created 1 year ago
Updated 7 months ago
Feedback? Help us improve.