ML benchmark for speed, scalability, accuracy of classification libraries
Top 23.6% on sourcepulse
This repository provides a minimal benchmark for evaluating the scalability, speed, and accuracy of popular open-source machine learning implementations for binary classification tasks. It targets data scientists and engineers working with tabular data, offering insights into which algorithms and libraries perform best on datasets with millions of observations and thousands of features.
How It Works
The benchmark focuses on binary classification with numeric and categorical inputs, simulating business applications like credit scoring or churn prediction. It varies dataset sizes from 10K to 10M observations with approximately 1K features (after one-hot encoding categoricals). Performance is measured by training time, peak RAM usage, CPU utilization, and AUC for algorithms including logistic regression, SVMs, random forests, gradient boosting machines (GBMs), and deep neural networks.
Quick Start & Requirements
c3.8xlarge
with 60GB RAM, r3.8xlarge
with 250GB RAM, and p2.xlarge
for GPU testing). Python 3 and R environments are necessary.Highlighted Details
Maintenance & Community
The project was primarily active around 2015-2018, with updates noted in January 2018. The author indicates a shift to a successor repository (GBM-perf) for more focused benchmarking, particularly on GBMs and GPU implementations, using Docker for reproducibility.
Licensing & Compatibility
The repository itself is not explicitly licensed in the README. The underlying tools benchmarked have various open-source licenses (e.g., MIT, Apache 2.0). Compatibility for commercial use depends on the licenses of the individual ML libraries used.
Limitations & Caveats
The benchmark is described as "minimal" and "incomplete," with a focus on specific data structures and problem types. The author notes that many results are from 2015 and that newer versions of tools might perform differently. The project's primary focus has shifted to a separate repository.
2 years ago
1 day