hpc-ops  by Tencent

Boosts LLM inference speed with production-grade operators

Created 1 month ago
734 stars

Top 47.1% on SourcePulse

GitHubView on GitHub
Project Summary

HPC-Ops is a production-grade operator library designed to accelerate Large Language Model (LLM) inference. Developed by Tencent's Hunyuan AI Infra team, it targets engineers and researchers seeking to enhance inference performance and simplify integration into existing frameworks. The library offers state-of-the-art (SOTA) performance, particularly on NVIDIA H20 GPUs, and provides a clean API for seamless adoption.

How It Works

The core of HPC-Ops lies in its deeply optimized kernels tailored for specific hardware, notably NVIDIA H20 GPUs, achieving significant speedups. It supports multiple data types, including BF16 and FP8 with various quantization schemes, enabling a balance between performance and memory efficiency. The library is designed for easy integration, offering a clean API compatible with popular inference frameworks like vLLM and SGLang. Kernel development leverages modern CUDA tools such as CuTe and CUTLASS, allowing for rapid implementation and optimization.

Quick Start & Requirements

  • Primary install / run command: Install from source by cloning the repository, building a wheel package, and then installing via pip:
    git clone https://github.com/Tencent/hpc-ops.git
    cd hpc-ops
    make wheel
    python3 -m pip install dist/*.whl
    
  • Non-default prerequisites and dependencies:
    • NVIDIA SM90 architecture GPU (e.g., H20)
    • Python 3.8 or higher
    • Compilers with C++17 support
    • CUDA Toolkit: 12.3 or higher
    • Environment setup can be managed using requirements-dev.txt.
  • Links: Usage examples are available in the tests/ directory.

Highlighted Details

  • Achieves SOTA performance with up to 2.22x speedup on NVIDIA H20 GPUs compared to baselines like FlashInfer, FA2, FA3, and TensorRT-LLM.
  • Optimized kernels include Attention (for prefill and decode phases, supporting paged attention), Grouped GEMM (with FP8 weights and quantization), and Fused MoE (with FP8 expert weights and quantization).
  • Supports FP8 weights with block-wise or per-tensor scaling for quantized operations.

Maintenance & Community

The roadmap includes developing Sparse Attention Kernels for long-context LLMs, extended quantization support (e.g., 4bit/8bit mixed-precision), and compute-communication boundary-breaking kernels for distributed inference. The project welcomes targeted contributions and actively seeks to refine the toolkit for production use. No specific community channels (like Discord/Slack) or sponsorship details are provided in the README.

Licensing & Compatibility

The provided README does not explicitly state the license type or any compatibility notes for commercial use or closed-source linking.

Limitations & Caveats

The library's performance optimizations are primarily focused on NVIDIA H20 GPUs. Performance can vary substantially across different inference scenarios and configurations.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
5
Issues (30d)
15
Star History
360 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI).

rtp-llm by alibaba

0.3%
1k
LLM inference engine for diverse applications
Created 2 years ago
Updated 17 hours ago
Starred by Lianmin Zheng Lianmin Zheng(Coauthor of SGLang, vLLM), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
1 more.

MiniCPM by OpenBMB

0.3%
9k
Ultra-efficient LLMs for end devices, achieving 5x+ speedup
Created 2 years ago
Updated 2 weeks ago
Feedback? Help us improve.