rtp-llm by alibaba

LLM inference engine for diverse applications

Created 2 years ago

995 stars

Top 37.3% on SourcePulse

View on GitHub

2 Experts Love This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Yineng Zhang

Inference Lead at SGLang; Research Scientist at Together AI

Project Summary

RTP-LLM is an open-source LLM inference acceleration engine developed by Alibaba, designed for high-performance and production-ready deployment across diverse applications. It targets developers and researchers needing to efficiently serve large language models, offering significant speedups and flexibility for various business units and AI platforms.

How It Works

RTP-LLM leverages high-performance CUDA kernels like PagedAttention and FlashAttention, combined with advanced techniques such as WeightOnly INT8/INT4 Quantization and adaptive KVCache Quantization. Its architecture is optimized for dynamic batching and specifically tuned for V100 GPUs, with ongoing efforts to support multiple hardware backends including AMD ROCm, Intel CPU, and ARM CPU.

Quick Start & Requirements

Installation: Docker or pip install with provided .whl files.
Requirements: Linux, Python 3.10+, NVIDIA GPU (Compute Capability 7.0+). CUDA 11 or 12 specific Docker images and dependencies are available.
Resources: Requires NVIDIA GPU with sufficient VRAM for the model.
Docs: Documentation, Docker Serving Example, Python Library Example.

Highlighted Details

Production-proven within Alibaba across multiple business units (Taobao, Tmall, etc.).
Supports advanced acceleration techniques including Speculative Decoding, Medusa, and Contextual Prefix Cache.
Handles multimodal inputs and deploys multiple LoRA services from a single model instance.
Offers multi-machine/multi-GPU tensor parallelism and P-tuning support.

Maintenance & Community

The project is actively developed by Alibaba's Foundation Model Inference Team. Community engagement channels include DingTalk and WeChat groups.

Licensing & Compatibility

The project is Apache 2.0 licensed, allowing for commercial use and integration with closed-source applications.

Limitations & Caveats

The project is primarily optimized for NVIDIA GPUs, with support for other hardware backends still under development. Some advanced features like Medusa may require specific configurations or hardware.

Health Check

Last Commit

14 hours ago

Responsiveness

1 week

Pull Requests (30d)

114

Issues (30d)

Star History

53 stars in the last 30 days