lmdeploy  by InternLM

Toolkit for LLM compression, deployment, and serving

created 2 years ago
6,804 stars

Top 7.6% on sourcepulse

GitHubView on GitHub
Project Summary

LMDeploy is a comprehensive toolkit designed for efficient compression, deployment, and serving of Large Language Models (LLMs) and Vision-Language Models (VLMs). It targets developers and researchers seeking to optimize LLM inference performance and simplify deployment across various hardware, offering significant speedups and advanced features.

How It Works

LMDeploy provides two distinct inference engines: TurboMind for maximum performance optimization and PyTorch for developer accessibility and rapid experimentation. TurboMind leverages techniques like persistent batching, blocked KV cache, dynamic split-and-fuse, tensor parallelism, and high-performance CUDA kernels to achieve superior throughput. The PyTorch engine, written entirely in Python, lowers the barrier to entry for new features and model integration.

Quick Start & Requirements

  • Installation: pip install lmdeploy (recommended in a conda environment with Python 3.8-3.12).
  • Prerequisites: CUDA 12+ is required for default prebuilt packages. CUDA 11 support and building from source are detailed in the installation guide.
  • Model Sources: Supports HuggingFace Hub by default; can be configured for ModelScope (export LMDEPLOY_USE_MODELSCOPE=True) or openMind Hub (export LMDEPLOY_USE_OPENMIND_HUB=True).
  • Documentation: https://lmdeploy.readthedocs.io/en/latest/
  • Quick Start: https://lmdeploy.readthedocs.io/en/latest/get_started/get_started.html

Highlighted Details

  • Achieves up to 1.8x higher request throughput than vLLM.
  • Supports 4-bit weight-only and KV cache quantization, with 4-bit inference being 2.4x faster than FP16.
  • Offers seamless integration for multi-model, multi-machine, and multi-card inference services.
  • Provides dedicated support for a wide range of LLMs and VLMs, including Llama, Qwen, InternLM, Mistral, Mixtral, Gemma, and various multimodal models.

Maintenance & Community

Active development with frequent updates, including recent support for Huawei Ascend, CUDA graphs, and new model architectures. Community channels include WeChat, Twitter, and Discord.

Licensing & Compatibility

Released under the Apache 2.0 license, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The default prebuilt package requires CUDA 12+, with specific instructions needed for older CUDA versions or building from source. While the PyTorch engine aims for accessibility, TurboMind's optimizations may require specific hardware configurations for maximum benefit.

Health Check
Last commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)
79
Issues (30d)
57
Star History
552 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jeremy Howard Jeremy Howard(Cofounder of fast.ai), and
4 more.

llm-awq by mit-han-lab

0.4%
3k
Weight quantization research paper for LLM compression/acceleration
created 2 years ago
updated 2 weeks ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Ying Sheng Ying Sheng(Author of SGLang).

fastllm by ztxz16

0.4%
4k
High-performance C++ LLM inference library
created 2 years ago
updated 2 weeks ago
Starred by Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), Michael Han Michael Han(Cofounder of Unsloth), and
1 more.

ktransformers by kvcache-ai

0.4%
15k
Framework for LLM inference optimization experimentation
created 1 year ago
updated 2 days ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
5 more.

TensorRT-LLM by NVIDIA

0.6%
11k
LLM inference optimization SDK for NVIDIA GPUs
created 1 year ago
updated 18 hours ago
Feedback? Help us improve.