fastllm is a C++-based, dependency-free (except CUDA) high-performance large model inference library. It supports tensor parallelism for dense models and mixed-mode inference for Mixture of Experts (MOE) models, enabling efficient deployment on consumer-grade hardware.
How It Works
The library is implemented in pure C++ with a backend-frontend separation design, facilitating cross-platform portability and easier integration of new compute devices. It leverages CUDA for GPU acceleration and offers support for ROCm and domestic Chinese GPUs (Tian, MuXi, Suiyuan). Key features include dynamic batching, streaming output, multi-NUMA node acceleration, and mixed CPU+GPU deployment.
Quick Start & Requirements
- Installation:
- Nvidia GPUs (via pip):
- Windows:
pip install https://hf-mirror.com/fastllm/fastllmdepend-windows/resolve/main/ftllmdepend-0.0.0.1-py3-none-win_amd64.whl
then pip install ftllm
- Linux: Ensure CUDA 12+ is installed, then
pip install ftllm
- Source installation is recommended for other GPUs (ROCm, domestic) or specific needs. Requires GCC (>=9.4), Make, CMake (>=3.23), and CUDA toolkit for GPU compilation.
- Prerequisites: CUDA 12+ for Nvidia GPU acceleration. ROCm support is available.
- Demo:
ftllm run Qwen/Qwen2-0.5B-Instruct
(command-line chat)
- Documentation: fastllm English Document (Note: The provided README is primarily in Chinese with an English title, actual English docs may be elsewhere or incomplete).
Highlighted Details
- Achieves 20+ TPS on DeepSeek R1 671B INT4 with a dual-socket server and single 4090 GPU.
- Supports MOE model inference and multi-NUMA node acceleration.
- Enables deployment of large models like DeepSeek R1 671B INT4 on consumer-grade GPUs (e.g., 24GB VRAM).
- Offers model export functionality for pre-quantized weights to speed up loading.
Maintenance & Community
- QQ Group: 831641348
- WeChat Group: Not specified.
Licensing & Compatibility
- License: Not explicitly stated in the provided text. Compatibility for commercial use is not detailed.
Limitations & Caveats
- PIP installation is currently limited to Nvidia GPUs; other platforms require source compilation.
- GGUF model format is not supported for loading.
- The README is predominantly in Chinese, which may hinder non-Chinese speakers.