flashlib  by FlashML-org

Accelerating classical machine learning with GPU operators

Created 2 weeks ago

New!

506 stars

Top 61.1% on SourcePulse

GitHubView on GitHub
Project Summary

Fast and memory-efficient classical machine learning operators on GPUs are provided by FlashML-org/flashlib. It targets engineers and researchers seeking to enhance the performance of ML pipelines by replacing CPU-bound or less efficient GPU implementations with specialized, high-throughput kernels.

How It Works

The library is built upon Triton and CuteDSL, enabling the creation of custom GPU kernels for classical ML tasks. This low-level control allows for significant optimizations in computation and memory access, outperforming standard implementations. The design prioritizes performance through a diverse set of specialized primitives and efficient data flow on NVIDIA GPUs.

Quick Start & Requirements

  • Installation: pip install flashlib or from source (git clone https://github.com/FlashML-org/flashlib.git, then pip install -e .).
  • Prerequisites: Requires a CUDA-enabled GPU.
  • Resources: GPU memory and compute are essential for primitive execution.
  • Links: Blog post for motivation, design, benchmarks, and full API: https://flashml-org.github.io/.

Highlighted Details

  • Offers 15 high-level primitives across Clustering, Nearest Neighbors, Decomposition, Manifold, Regression, Classification, and Preprocessing families.
  • Includes low-level linear algebra operations and numerous multi-precision GEMM variants.
  • Features a flashlib.info submodule that estimates runtime, FLOPs, and HBM bytes in ~5 µs on CPU, aiding pipeline budgeting.
  • Primitives are accessible as both top-level flash_* functions and scikit-learn-style classes.

Maintenance & Community

The provided README does not detail specific contributors, sponsorships, community channels (e.g., Discord, Slack), or a public roadmap.

Licensing & Compatibility

Licensed under the Apache License 2.0, which permits commercial use and integration into closed-source projects.

Limitations & Caveats

FlashLib is focused exclusively on classical machine learning operators and requires a CUDA-enabled GPU for its core functionalities. It does not cover deep learning models.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
507 stars in the last 17 days

Explore Similar Projects

Starred by Tri Dao Tri Dao(Chief Scientist at Together AI), Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and
1 more.

oslo by tunib-ai

0%
309
Framework for large-scale transformer optimization
Created 4 years ago
Updated 3 years ago
Starred by David Cournapeau David Cournapeau(Author of scikit-learn), Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and
5 more.

lectures by gpu-mode

0.3%
6k
Lecture series for GPU-accelerated computing
Created 2 years ago
Updated 1 month ago
Starred by Bojan Tunguz Bojan Tunguz(AI Scientist; Formerly at NVIDIA), Alex Chen Alex Chen(Cofounder of Nexa AI), and
19 more.

ggml by ggml-org

0.1%
15k
Tensor library for machine learning
Created 3 years ago
Updated 1 day ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), and
34 more.

flash-attention by Dao-AILab

0.2%
24k
Fast, memory-efficient attention implementation
Created 4 years ago
Updated 2 days ago
Feedback? Help us improve.