deep-significance  by Kaleidophon

SDK for statistical significance testing of deep neural networks

Created 4 years ago
336 stars

Top 81.8% on SourcePulse

GitHubView on GitHub
Project Summary

This library provides statistical significance testing for deep neural networks, addressing the common issue of drawing conclusions from single performance scores rather than rigorous statistical analysis. It is targeted at machine learning practitioners and researchers who need to reliably compare model performance, offering methods to mitigate the impact of stochastic factors and hyperparameter sensitivity inherent in deep learning.

How It Works

The core of the library is the "Almost Stochastic Order" (ASO) test, which compares score distributions without assuming specific data distributions. Unlike p-value based tests, ASO quantifies the violation of stochastic dominance between two distributions, providing a score ($\epsilon_\text{min}$) that indicates superiority. Lower $\epsilon_\text{min}$ values suggest higher confidence in one model's performance over another. The library also includes traditional bootstrap and permutation tests, along with Bonferroni correction for multiple comparisons and bootstrap power analysis for sample size determination.

Quick Start & Requirements

  • Installation: pip3 install deepsig
  • Dependencies: Python, NumPy, PyTorch, TensorFlow, Jax (for tensor compatibility).
  • Usage: The aso() function is the primary interface for comparing two sets of scores. multi_aso() handles comparisons across multiple models.

Highlighted Details

  • Implements "Almost Stochastic Order" (ASO) for robust model comparison, addressing non-convex loss landscapes and stochastic factors in deep learning.
  • Supports PyTorch, TensorFlow, Jax, and NumPy arrays directly.
  • Includes functions for Bonferroni correction, bootstrap power analysis, and uncertainty reduction estimation.
  • Offers multi-threading support via joblib for faster computations.

Maintenance & Community

The project is actively maintained, with contributions from the NLPnorth group at IT University Copenhagen. The README links to several papers that have utilized the library, indicating ongoing adoption and research use.

Licensing & Compatibility

The library is released under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The README emphasizes that conclusions drawn from significance tests are only as reliable as the number of scores collected. It also notes that while ASO is generally preferred over traditional tests in deep learning, the choice of $\epsilon_\text{min}$ threshold can impact Type I error rates, with a recommendation of $\tau < 0.2$ for more confidence.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Christian Laforte Christian Laforte(Distinguished Engineer at NVIDIA; Former CTO at Stability AI) and Daniel Han Daniel Han(Cofounder of Unsloth).

cifar10-airbench by KellerJordan

1.0%
295
Fast CIFAR-10 training benchmarks
Created 1 year ago
Updated 2 months ago
Starred by Zhiqiang Xie Zhiqiang Xie(Coauthor of SGLang), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
1 more.

KernelBench by ScalingIntelligence

1.9%
569
Benchmark for LLMs generating GPU kernels from PyTorch ops
Created 10 months ago
Updated 3 weeks ago
Feedback? Help us improve.