Toolkit for LLM compression, deployment, and serving
Top 7.6% on sourcepulse
LMDeploy is a comprehensive toolkit designed for efficient compression, deployment, and serving of Large Language Models (LLMs) and Vision-Language Models (VLMs). It targets developers and researchers seeking to optimize LLM inference performance and simplify deployment across various hardware, offering significant speedups and advanced features.
How It Works
LMDeploy provides two distinct inference engines: TurboMind for maximum performance optimization and PyTorch for developer accessibility and rapid experimentation. TurboMind leverages techniques like persistent batching, blocked KV cache, dynamic split-and-fuse, tensor parallelism, and high-performance CUDA kernels to achieve superior throughput. The PyTorch engine, written entirely in Python, lowers the barrier to entry for new features and model integration.
Quick Start & Requirements
pip install lmdeploy
(recommended in a conda environment with Python 3.8-3.12).export LMDEPLOY_USE_MODELSCOPE=True
) or openMind Hub (export LMDEPLOY_USE_OPENMIND_HUB=True
).Highlighted Details
Maintenance & Community
Active development with frequent updates, including recent support for Huawei Ascend, CUDA graphs, and new model architectures. Community channels include WeChat, Twitter, and Discord.
Licensing & Compatibility
Released under the Apache 2.0 license, permitting commercial use and integration with closed-source projects.
Limitations & Caveats
The default prebuilt package requires CUDA 12+, with specific instructions needed for older CUDA versions or building from source. While the PyTorch engine aims for accessibility, TurboMind's optimizations may require specific hardware configurations for maximum benefit.
1 day ago
1 day