Discover and explore top open-source AI tools and projects—updated daily.
High-performance C++ LLM inference library
Top 12.5% on SourcePulse
fastllm is a C++-based, dependency-free (except CUDA) high-performance large model inference library. It supports tensor parallelism for dense models and mixed-mode inference for Mixture of Experts (MOE) models, enabling efficient deployment on consumer-grade hardware.
How It Works
The library is implemented in pure C++ with a backend-frontend separation design, facilitating cross-platform portability and easier integration of new compute devices. It leverages CUDA for GPU acceleration and offers support for ROCm and domestic Chinese GPUs (Tian, MuXi, Suiyuan). Key features include dynamic batching, streaming output, multi-NUMA node acceleration, and mixed CPU+GPU deployment.
Quick Start & Requirements
pip install https://hf-mirror.com/fastllm/fastllmdepend-windows/resolve/main/ftllmdepend-0.0.0.1-py3-none-win_amd64.whl
then pip install ftllm
pip install ftllm
ftllm run Qwen/Qwen2-0.5B-Instruct
(command-line chat)Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
1 week ago
1 day