fastllm  by ztxz16

High-performance C++ LLM inference library

created 2 years ago
3,828 stars

Top 13.0% on sourcepulse

GitHubView on GitHub
Project Summary

fastllm is a C++-based, dependency-free (except CUDA) high-performance large model inference library. It supports tensor parallelism for dense models and mixed-mode inference for Mixture of Experts (MOE) models, enabling efficient deployment on consumer-grade hardware.

How It Works

The library is implemented in pure C++ with a backend-frontend separation design, facilitating cross-platform portability and easier integration of new compute devices. It leverages CUDA for GPU acceleration and offers support for ROCm and domestic Chinese GPUs (Tian, MuXi, Suiyuan). Key features include dynamic batching, streaming output, multi-NUMA node acceleration, and mixed CPU+GPU deployment.

Quick Start & Requirements

  • Installation:
    • Nvidia GPUs (via pip):
      • Windows: pip install https://hf-mirror.com/fastllm/fastllmdepend-windows/resolve/main/ftllmdepend-0.0.0.1-py3-none-win_amd64.whl then pip install ftllm
      • Linux: Ensure CUDA 12+ is installed, then pip install ftllm
    • Source installation is recommended for other GPUs (ROCm, domestic) or specific needs. Requires GCC (>=9.4), Make, CMake (>=3.23), and CUDA toolkit for GPU compilation.
  • Prerequisites: CUDA 12+ for Nvidia GPU acceleration. ROCm support is available.
  • Demo: ftllm run Qwen/Qwen2-0.5B-Instruct (command-line chat)
  • Documentation: fastllm English Document (Note: The provided README is primarily in Chinese with an English title, actual English docs may be elsewhere or incomplete).

Highlighted Details

  • Achieves 20+ TPS on DeepSeek R1 671B INT4 with a dual-socket server and single 4090 GPU.
  • Supports MOE model inference and multi-NUMA node acceleration.
  • Enables deployment of large models like DeepSeek R1 671B INT4 on consumer-grade GPUs (e.g., 24GB VRAM).
  • Offers model export functionality for pre-quantized weights to speed up loading.

Maintenance & Community

  • QQ Group: 831641348
  • WeChat Group: Not specified.

Licensing & Compatibility

  • License: Not explicitly stated in the provided text. Compatibility for commercial use is not detailed.

Limitations & Caveats

  • PIP installation is currently limited to Nvidia GPUs; other platforms require source compilation.
  • GGUF model format is not supported for loading.
  • The README is predominantly in Chinese, which may hinder non-Chinese speakers.
Health Check
Last commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)
2
Issues (30d)
18
Star History
317 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jaret Burkett Jaret Burkett(Founder of Ostris), and
1 more.

nunchaku by nunchaku-tech

2.1%
3k
High-performance 4-bit diffusion model inference engine
created 8 months ago
updated 14 hours ago
Feedback? Help us improve.