rtp-llm  by alibaba

LLM inference engine for diverse applications

created 1 year ago
819 stars

Top 44.2% on sourcepulse

GitHubView on GitHub
Project Summary

RTP-LLM is an open-source LLM inference acceleration engine developed by Alibaba, designed for high-performance and production-ready deployment across diverse applications. It targets developers and researchers needing to efficiently serve large language models, offering significant speedups and flexibility for various business units and AI platforms.

How It Works

RTP-LLM leverages high-performance CUDA kernels like PagedAttention and FlashAttention, combined with advanced techniques such as WeightOnly INT8/INT4 Quantization and adaptive KVCache Quantization. Its architecture is optimized for dynamic batching and specifically tuned for V100 GPUs, with ongoing efforts to support multiple hardware backends including AMD ROCm, Intel CPU, and ARM CPU.

Quick Start & Requirements

  • Installation: Docker or pip install with provided .whl files.
  • Requirements: Linux, Python 3.10+, NVIDIA GPU (Compute Capability 7.0+). CUDA 11 or 12 specific Docker images and dependencies are available.
  • Resources: Requires NVIDIA GPU with sufficient VRAM for the model.
  • Docs: Documentation, Docker Serving Example, Python Library Example.

Highlighted Details

  • Production-proven within Alibaba across multiple business units (Taobao, Tmall, etc.).
  • Supports advanced acceleration techniques including Speculative Decoding, Medusa, and Contextual Prefix Cache.
  • Handles multimodal inputs and deploys multiple LoRA services from a single model instance.
  • Offers multi-machine/multi-GPU tensor parallelism and P-tuning support.

Maintenance & Community

The project is actively developed by Alibaba's Foundation Model Inference Team. Community engagement channels include DingTalk and WeChat groups.

Licensing & Compatibility

The project is Apache 2.0 licensed, allowing for commercial use and integration with closed-source applications.

Limitations & Caveats

The project is primarily optimized for NVIDIA GPUs, with support for other hardware backends still under development. Some advanced features like Medusa may require specific configurations or hardware.

Health Check
Last commit

5 days ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
1
Star History
108 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
5 more.

TensorRT-LLM by NVIDIA

0.6%
11k
LLM inference optimization SDK for NVIDIA GPUs
created 1 year ago
updated 17 hours ago
Feedback? Help us improve.