flashlib by FlashML-org

Accelerating classical machine learning with GPU operators

Created 2 weeks ago

New!

506 stars

Top 61.1% on SourcePulse

Project Summary

Fast and memory-efficient classical machine learning operators on GPUs are provided by FlashML-org/flashlib. It targets engineers and researchers seeking to enhance the performance of ML pipelines by replacing CPU-bound or less efficient GPU implementations with specialized, high-throughput kernels.

How It Works

The library is built upon Triton and CuteDSL, enabling the creation of custom GPU kernels for classical ML tasks. This low-level control allows for significant optimizations in computation and memory access, outperforming standard implementations. The design prioritizes performance through a diverse set of specialized primitives and efficient data flow on NVIDIA GPUs.

Quick Start & Requirements

Installation: pip install flashlib or from source (git clone https://github.com/FlashML-org/flashlib.git, then pip install -e .).
Prerequisites: Requires a CUDA-enabled GPU.
Resources: GPU memory and compute are essential for primitive execution.
Links: Blog post for motivation, design, benchmarks, and full API: https://flashml-org.github.io/.

Highlighted Details

Offers 15 high-level primitives across Clustering, Nearest Neighbors, Decomposition, Manifold, Regression, Classification, and Preprocessing families.
Includes low-level linear algebra operations and numerous multi-precision GEMM variants.
Features a flashlib.info submodule that estimates runtime, FLOPs, and HBM bytes in ~5 µs on CPU, aiding pipeline budgeting.
Primitives are accessible as both top-level flash_* functions and scikit-learn-style classes.

Maintenance & Community

The provided README does not detail specific contributors, sponsorships, community channels (e.g., Discord, Slack), or a public roadmap.

Licensing & Compatibility

Licensed under the Apache License 2.0, which permits commercial use and integration into closed-source projects.

Limitations & Caveats

FlashLib is focused exclusively on classical machine learning operators and requires a CUDA-enabled GPU for its core functionalities. It does not cover deep learning models.

Health Check

Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

507 stars in the last 17 days