mHC.cu  by AndreSlavescu

Accelerating deep learning with CUDA mHC kernels

Created 1 month ago
251 stars

Top 99.8% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides an unofficial CUDA implementation of DeepSeek-AI's Manifold-Constrained Hyper-Connections (mHC) layer. It targets researchers and engineers seeking to accelerate deep learning model training and inference on NVIDIA GPUs by offering highly optimized kernels. The primary benefit is substantial performance gains over standard PyTorch implementations.

How It Works

The project implements mHC kernels in CUDA, enabling direct GPU acceleration. It supports two modes: the default "Dynamic H Path," where H values are computed per-batch via learned projections, and a "Static H Path" optimized for faster inference by sharing H across the batch. This native CUDA approach bypasses PyTorch's overhead for significant speedups.

Quick Start & Requirements

  • Installation: Install PyTorch extension via make install. For development, use make install-dev.
  • Build: Compile C++/CUDA source with make for all architectures, or make CUDA_ARCH=90 for specific NVIDIA architectures like H100.
  • Testing: Run C++/CUDA tests with make test and Python tests with make test-python.
  • Benchmarking: Execute C++/CUDA benchmarks via make bench and Python benchmarks via make bench-python.
  • Prerequisites: Requires CUDA-enabled NVIDIA GPUs (e.g., H100, B200). PyTorch is necessary for the extension.
  • Modal Usage: The runmodal.py script facilitates testing and benchmarking on cloud GPUs (e.g., modal run runmodal.py --gpu h100 --mode bench).
  • Documentation: The original mHC paper is available at https://arxiv.org/abs/2512.24880.

Highlighted Details

  • Achieves significant speedups compared to naive PyTorch mHC implementations, with benchmarks showing up to 13.6x faster forward and 11.1x faster backward passes for the static H path.
  • Dynamic H path offers up to 11.0x backward speedup while matching the paper's architecture.
  • Kernels are optimized for NVIDIA GPUs, specifically tested on H100 SXM5.
  • Provides both inference-optimized static H path and a dynamic H path matching the original paper's approach.

Maintenance & Community

Contribution guidelines are detailed in CONTRIBUTING.md. The project is associated with DeepSeek-AI, the authors of the original mHC paper. No specific community channels (like Discord/Slack) or roadmap links are provided in the README.

Licensing & Compatibility

The license type is not explicitly stated in the provided README snippet. This requires clarification for commercial use or integration into closed-source projects.

Limitations & Caveats

The implementation is CUDA-specific, requiring NVIDIA hardware and a compatible CUDA toolkit. The project appears to be an unofficial implementation, focusing on performance optimization rather than a full-featured library. The license status is unknown, which could impact adoption.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
2
Star History
14 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Ying Sheng Ying Sheng(Coauthor of SGLang).

fastllm by ztxz16

0.1%
4k
High-performance C++ LLM inference library
Created 2 years ago
Updated 17 hours ago
Starred by François Chollet François Chollet(Author of Keras; Cofounder of Ndea, ARC Prize), Chaoyu Yang Chaoyu Yang(Founder of Bento), and
13 more.

neon by NervanaSystems

0%
4k
Deep learning framework (discontinued)
Created 11 years ago
Updated 5 years ago
Feedback? Help us improve.