mHC.cu by AndreSlavescu

Accelerating deep learning with CUDA mHC kernels

Created 6 months ago

264 stars

Top 96.5% on SourcePulse

Project Summary

This repository provides an unofficial CUDA implementation of DeepSeek-AI's Manifold-Constrained Hyper-Connections (mHC) layer. It targets researchers and engineers seeking to accelerate deep learning model training and inference on NVIDIA GPUs by offering highly optimized kernels. The primary benefit is substantial performance gains over standard PyTorch implementations.

How It Works

The project implements mHC kernels in CUDA, enabling direct GPU acceleration. It supports two modes: the default "Dynamic H Path," where H values are computed per-batch via learned projections, and a "Static H Path" optimized for faster inference by sharing H across the batch. This native CUDA approach bypasses PyTorch's overhead for significant speedups.

Quick Start & Requirements

Installation: Install PyTorch extension via make install. For development, use make install-dev.
Build: Compile C++/CUDA source with make for all architectures, or make CUDA_ARCH=90 for specific NVIDIA architectures like H100.
Testing: Run C++/CUDA tests with make test and Python tests with make test-python.
Benchmarking: Execute C++/CUDA benchmarks via make bench and Python benchmarks via make bench-python.
Prerequisites: Requires CUDA-enabled NVIDIA GPUs (e.g., H100, B200). PyTorch is necessary for the extension.
Modal Usage: The runmodal.py script facilitates testing and benchmarking on cloud GPUs (e.g., modal run runmodal.py --gpu h100 --mode bench).
Documentation: The original mHC paper is available at https://arxiv.org/abs/2512.24880.

Highlighted Details

Achieves significant speedups compared to naive PyTorch mHC implementations, with benchmarks showing up to 13.6x faster forward and 11.1x faster backward passes for the static H path.
Dynamic H path offers up to 11.0x backward speedup while matching the paper's architecture.
Kernels are optimized for NVIDIA GPUs, specifically tested on H100 SXM5.
Provides both inference-optimized static H path and a dynamic H path matching the original paper's approach.

Maintenance & Community

Contribution guidelines are detailed in CONTRIBUTING.md. The project is associated with DeepSeek-AI, the authors of the original mHC paper. No specific community channels (like Discord/Slack) or roadmap links are provided in the README.

Licensing & Compatibility

The license type is not explicitly stated in the provided README snippet. This requires clarification for commercial use or integration into closed-source projects.

Limitations & Caveats

The implementation is CUDA-specific, requiring NVIDIA hardware and a compatible CUDA toolkit. The project appears to be an unofficial implementation, focusing on performance optimization rather than a full-featured library. The license status is unknown, which could impact adoption.

mHC.cu by AndreSlavescu

Explore Similar Projects

fp6_llm by usyd-fsalab

varuna by microsoft

Crane by lucasjinreal

FlashRT by flashrt-project

FlashQLA by QwenLM

Tutel by microsoft

transformers-benchmarks by mli

bolt by huawei-noah

DeepBench by baidu-research

TensorRT by pytorch

fastllm by ztxz16

neon by NervanaSystems