SDK for statistical significance testing of deep neural networks
Top 83.0% on sourcepulse
This library provides statistical significance testing for deep neural networks, addressing the common issue of drawing conclusions from single performance scores rather than rigorous statistical analysis. It is targeted at machine learning practitioners and researchers who need to reliably compare model performance, offering methods to mitigate the impact of stochastic factors and hyperparameter sensitivity inherent in deep learning.
How It Works
The core of the library is the "Almost Stochastic Order" (ASO) test, which compares score distributions without assuming specific data distributions. Unlike p-value based tests, ASO quantifies the violation of stochastic dominance between two distributions, providing a score ($\epsilon_\text{min}$) that indicates superiority. Lower $\epsilon_\text{min}$ values suggest higher confidence in one model's performance over another. The library also includes traditional bootstrap and permutation tests, along with Bonferroni correction for multiple comparisons and bootstrap power analysis for sample size determination.
Quick Start & Requirements
pip3 install deepsig
aso()
function is the primary interface for comparing two sets of scores. multi_aso()
handles comparisons across multiple models.Highlighted Details
joblib
for faster computations.Maintenance & Community
The project is actively maintained, with contributions from the NLPnorth group at IT University Copenhagen. The README links to several papers that have utilized the library, indicating ongoing adoption and research use.
Licensing & Compatibility
The library is released under the MIT License, permitting commercial use and integration with closed-source projects.
Limitations & Caveats
The README emphasizes that conclusions drawn from significance tests are only as reliable as the number of scores collected. It also notes that while ASO is generally preferred over traditional tests in deep learning, the choice of $\epsilon_\text{min}$ threshold can impact Type I error rates, with a recommendation of $\tau < 0.2$ for more confidence.
1 year ago
Inactive