TileGym  by NVIDIA

CUDA Tile kernel library for efficient GPU programming

Created 4 months ago
667 stars

Top 50.5% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

TileGym is a CUDA Tile kernel library designed to simplify and accelerate tile-based GPU programming. It offers a comprehensive collection of tutorials and practical examples, targeting developers learning GPU kernel optimization or seeking to enhance large language model (LLM) performance. By providing efficient kernel implementations and end-to-end integration examples with models like Llama 3.1 and DeepSeek V2, TileGym enables users to build and benchmark high-performance GPU kernels.

How It Works

The project leverages CUDA Tile to provide optimized kernel implementations for common deep learning operations. Its core approach focuses on practical, tile-based programming patterns, demonstrating how to achieve efficiency through careful memory access and computation tiling. This is exemplified by its integration examples, showcasing how these optimized kernels can directly accelerate inference for popular LLMs, offering a tangible benefit for performance-critical applications.

Quick Start & Requirements

  • Primary install: Clone the repository, cd into the directory, and run pip install .. An editable install is pip install -e .. A Dockerfile is also provided.
  • Prerequisites: Requires CUDA 13.1 and NVIDIA Blackwell architecture GPUs (e.g., B200, RTX 5080, RTX 5090). PyTorch version 2.9.1 or compatible is necessary.
  • Links: CUDA Downloads: https://developer.nvidia.com/cuda-downloads. cutile-python: https://github.com/nvidia/cutile-python.

Highlighted Details

  • Rich collection of CUDA Tile kernel examples.
  • Practical kernel implementations for common deep learning operators.
  • Performance benchmarking tools to evaluate kernel efficiency.
  • End-to-end integration examples with LLMs like Llama 3.1 and DeepSeek V2.

Maintenance & Community

The project welcomes contributions and outlines guidelines in CONTRIBUTING.md, including a Contributor License Agreement (CLA) process. Specific community channels or roadmap details are not detailed in the provided README.

Licensing & Compatibility

  • License: MIT LICENSE.
  • Compatibility: The MIT license is permissive and generally compatible with commercial use and closed-source projects.

Limitations & Caveats

Currently, TileGym is built and tested exclusively on CUDA 13.1 and requires NVIDIA Blackwell architecture GPUs. Support for other GPU architectures is planned for future releases.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
21
Issues (30d)
0
Star History
31 stars in the last 30 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Benjamin Bolte Benjamin Bolte(Cofounder of K-Scale Labs), and
18 more.

ThunderKittens by HazyResearch

0.6%
3k
CUDA kernel framework for fast deep learning primitives
Created 2 years ago
Updated 1 day ago
Feedback? Help us improve.