TileGym  by NVIDIA

CUDA Tile kernel library for efficient GPU programming

Created 1 month ago
554 stars

Top 57.7% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

TileGym is a CUDA Tile kernel library designed to simplify and accelerate tile-based GPU programming. It offers a comprehensive collection of tutorials and practical examples, targeting developers learning GPU kernel optimization or seeking to enhance large language model (LLM) performance. By providing efficient kernel implementations and end-to-end integration examples with models like Llama 3.1 and DeepSeek V2, TileGym enables users to build and benchmark high-performance GPU kernels.

How It Works

The project leverages CUDA Tile to provide optimized kernel implementations for common deep learning operations. Its core approach focuses on practical, tile-based programming patterns, demonstrating how to achieve efficiency through careful memory access and computation tiling. This is exemplified by its integration examples, showcasing how these optimized kernels can directly accelerate inference for popular LLMs, offering a tangible benefit for performance-critical applications.

Quick Start & Requirements

  • Primary install: Clone the repository, cd into the directory, and run pip install .. An editable install is pip install -e .. A Dockerfile is also provided.
  • Prerequisites: Requires CUDA 13.1 and NVIDIA Blackwell architecture GPUs (e.g., B200, RTX 5080, RTX 5090). PyTorch version 2.9.1 or compatible is necessary.
  • Links: CUDA Downloads: https://developer.nvidia.com/cuda-downloads. cutile-python: https://github.com/nvidia/cutile-python.

Highlighted Details

  • Rich collection of CUDA Tile kernel examples.
  • Practical kernel implementations for common deep learning operators.
  • Performance benchmarking tools to evaluate kernel efficiency.
  • End-to-end integration examples with LLMs like Llama 3.1 and DeepSeek V2.

Maintenance & Community

The project welcomes contributions and outlines guidelines in CONTRIBUTING.md, including a Contributor License Agreement (CLA) process. Specific community channels or roadmap details are not detailed in the provided README.

Licensing & Compatibility

  • License: MIT LICENSE.
  • Compatibility: The MIT license is permissive and generally compatible with commercial use and closed-source projects.

Limitations & Caveats

Currently, TileGym is built and tested exclusively on CUDA 13.1 and requires NVIDIA Blackwell architecture GPUs. Support for other GPU architectures is planned for future releases.

Health Check
Last Commit

3 days ago

Responsiveness

Inactive

Pull Requests (30d)
23
Issues (30d)
1
Star History
167 stars in the last 30 days

Explore Similar Projects

Starred by George Hotz George Hotz(Author of tinygrad; Founder of the tiny corp, comma.ai), Zhuohan Li Zhuohan Li(Coauthor of vLLM), and
4 more.

mirage by mirage-project

1.6%
2k
Tool for fast GPU kernel generation via superoptimization
Created 1 year ago
Updated 3 days ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), and
17 more.

ThunderKittens by HazyResearch

0.5%
3k
CUDA kernel framework for fast deep learning primitives
Created 1 year ago
Updated 20 hours ago
Starred by Tri Dao Tri Dao(Chief Scientist at Together AI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
23 more.

cutlass by NVIDIA

0.5%
9k
CUDA C++ and Python DSLs for high-performance linear algebra
Created 8 years ago
Updated 2 days ago
Feedback? Help us improve.