hpc-ops by Tencent

Boosts LLM inference speed with production-grade operators

Created 5 months ago

1,018 stars

Top 36.0% on SourcePulse

View on GitHub

1 Expert Loves This Project

Yineng Zhang

Inference Lead at SGLang; Research Scientist at Together AI

Project Summary

HPC-Ops is a production-grade operator library designed to accelerate Large Language Model (LLM) inference. Developed by Tencent's Hunyuan AI Infra team, it targets engineers and researchers seeking to enhance inference performance and simplify integration into existing frameworks. The library offers state-of-the-art (SOTA) performance, particularly on NVIDIA H20 GPUs, and provides a clean API for seamless adoption.

How It Works

The core of HPC-Ops lies in its deeply optimized kernels tailored for specific hardware, notably NVIDIA H20 GPUs, achieving significant speedups. It supports multiple data types, including BF16 and FP8 with various quantization schemes, enabling a balance between performance and memory efficiency. The library is designed for easy integration, offering a clean API compatible with popular inference frameworks like vLLM and SGLang. Kernel development leverages modern CUDA tools such as CuTe and CUTLASS, allowing for rapid implementation and optimization.

Quick Start & Requirements

Primary install / run command: Install from source by cloning the repository, building a wheel package, and then installing via pip:
```
git clone https://github.com/Tencent/hpc-ops.git
cd hpc-ops
make wheel
python3 -m pip install dist/*.whl
```
Non-default prerequisites and dependencies:
- NVIDIA SM90 architecture GPU (e.g., H20)
- Python 3.8 or higher
- Compilers with C++17 support
- CUDA Toolkit: 12.3 or higher
- Environment setup can be managed using requirements-dev.txt.
Links: Usage examples are available in the tests/ directory.

Highlighted Details

Achieves SOTA performance with up to 2.22x speedup on NVIDIA H20 GPUs compared to baselines like FlashInfer, FA2, FA3, and TensorRT-LLM.
Optimized kernels include Attention (for prefill and decode phases, supporting paged attention), Grouped GEMM (with FP8 weights and quantization), and Fused MoE (with FP8 expert weights and quantization).
Supports FP8 weights with block-wise or per-tensor scaling for quantized operations.

Maintenance & Community

The roadmap includes developing Sparse Attention Kernels for long-context LLMs, extended quantization support (e.g., 4bit/8bit mixed-precision), and compute-communication boundary-breaking kernels for distributed inference. The project welcomes targeted contributions and actively seeks to refine the toolkit for production use. No specific community channels (like Discord/Slack) or sponsorship details are provided in the README.

Licensing & Compatibility

The provided README does not explicitly state the license type or any compatibility notes for commercial use or closed-source linking.

Limitations & Caveats

The library's performance optimizations are primarily focused on NVIDIA H20 GPUs. Performance can vary substantially across different inference scenarios and configurations.

Health Check

Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

111 stars in the last 30 days