fastllm by ztxz16

High-performance C++ LLM inference library

Created 2 years ago

4,127 stars

Top 11.8% on SourcePulse

View on GitHub

2 Experts Love This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Ying Sheng

Coauthor of SGLang

Project Summary

fastllm is a C++-based, dependency-free (except CUDA) high-performance large model inference library. It supports tensor parallelism for dense models and mixed-mode inference for Mixture of Experts (MOE) models, enabling efficient deployment on consumer-grade hardware.

How It Works

The library is implemented in pure C++ with a backend-frontend separation design, facilitating cross-platform portability and easier integration of new compute devices. It leverages CUDA for GPU acceleration and offers support for ROCm and domestic Chinese GPUs (Tian, MuXi, Suiyuan). Key features include dynamic batching, streaming output, multi-NUMA node acceleration, and mixed CPU+GPU deployment.

Quick Start & Requirements

Installation:
- Nvidia GPUs (via pip):
  - Windows: pip install https://hf-mirror.com/fastllm/fastllmdepend-windows/resolve/main/ftllmdepend-0.0.0.1-py3-none-win_amd64.whl then pip install ftllm
  - Linux: Ensure CUDA 12+ is installed, then pip install ftllm
- Source installation is recommended for other GPUs (ROCm, domestic) or specific needs. Requires GCC (>=9.4), Make, CMake (>=3.23), and CUDA toolkit for GPU compilation.
Prerequisites: CUDA 12+ for Nvidia GPU acceleration. ROCm support is available.
Demo: ftllm run Qwen/Qwen2-0.5B-Instruct (command-line chat)
Documentation: fastllm English Document (Note: The provided README is primarily in Chinese with an English title, actual English docs may be elsewhere or incomplete).

Highlighted Details

Achieves 20+ TPS on DeepSeek R1 671B INT4 with a dual-socket server and single 4090 GPU.
Supports MOE model inference and multi-NUMA node acceleration.
Enables deployment of large models like DeepSeek R1 671B INT4 on consumer-grade GPUs (e.g., 24GB VRAM).
Offers model export functionality for pre-quantized weights to speed up loading.

Maintenance & Community

QQ Group: 831641348
WeChat Group: Not specified.

Licensing & Compatibility

License: Not explicitly stated in the provided text. Compatibility for commercial use is not detailed.

Limitations & Caveats

PIP installation is currently limited to Nvidia GPUs; other platforms require source compilation.
GGUF model format is not supported for loading.
The README is predominantly in Chinese, which may hinder non-Chinese speakers.

Health Check

Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

22 stars in the last 30 days