lmdeploy by InternLM

Toolkit for LLM compression, deployment, and serving

Created 2 years ago

7,497 stars

Top 6.8% on SourcePulse

View on GitHub

13 Experts Love This Project

Cofounder of Lightning AI

Jesse Clark

Cofounder of Marqo

and 9 more!

Project Summary

LMDeploy is a comprehensive toolkit designed for efficient compression, deployment, and serving of Large Language Models (LLMs) and Vision-Language Models (VLMs). It targets developers and researchers seeking to optimize LLM inference performance and simplify deployment across various hardware, offering significant speedups and advanced features.

How It Works

LMDeploy provides two distinct inference engines: TurboMind for maximum performance optimization and PyTorch for developer accessibility and rapid experimentation. TurboMind leverages techniques like persistent batching, blocked KV cache, dynamic split-and-fuse, tensor parallelism, and high-performance CUDA kernels to achieve superior throughput. The PyTorch engine, written entirely in Python, lowers the barrier to entry for new features and model integration.

Quick Start & Requirements

Installation: pip install lmdeploy (recommended in a conda environment with Python 3.8-3.12).
Prerequisites: CUDA 12+ is required for default prebuilt packages. CUDA 11 support and building from source are detailed in the installation guide.
Model Sources: Supports HuggingFace Hub by default; can be configured for ModelScope (export LMDEPLOY_USE_MODELSCOPE=True) or openMind Hub (export LMDEPLOY_USE_OPENMIND_HUB=True).
Documentation: https://lmdeploy.readthedocs.io/en/latest/
Quick Start: https://lmdeploy.readthedocs.io/en/latest/get_started/get_started.html

Highlighted Details

Achieves up to 1.8x higher request throughput than vLLM.
Supports 4-bit weight-only and KV cache quantization, with 4-bit inference being 2.4x faster than FP16.
Offers seamless integration for multi-model, multi-machine, and multi-card inference services.
Provides dedicated support for a wide range of LLMs and VLMs, including Llama, Qwen, InternLM, Mistral, Mixtral, Gemma, and various multimodal models.

Maintenance & Community

Active development with frequent updates, including recent support for Huawei Ascend, CUDA graphs, and new model architectures. Community channels include WeChat, Twitter, and Discord.

Licensing & Compatibility

Released under the Apache 2.0 license, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The default prebuilt package requires CUDA 12+, with specific instructions needed for older CUDA versions or building from source. While the PyTorch engine aims for accessibility, TurboMind's optimizations may require specific hardware configurations for maximum benefit.

Health Check

Last Commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

136 stars in the last 30 days