benchm-ml  by szilard

ML benchmark for speed, scalability, accuracy of classification libraries

created 10 years ago
1,886 stars

Top 23.6% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a minimal benchmark for evaluating the scalability, speed, and accuracy of popular open-source machine learning implementations for binary classification tasks. It targets data scientists and engineers working with tabular data, offering insights into which algorithms and libraries perform best on datasets with millions of observations and thousands of features.

How It Works

The benchmark focuses on binary classification with numeric and categorical inputs, simulating business applications like credit scoring or churn prediction. It varies dataset sizes from 10K to 10M observations with approximately 1K features (after one-hot encoding categoricals). Performance is measured by training time, peak RAM usage, CPU utilization, and AUC for algorithms including logistic regression, SVMs, random forests, gradient boosting machines (GBMs), and deep neural networks.

Quick Start & Requirements

  • Installation: The README does not provide a direct installation command. Users are expected to set up the environments for the various ML tools (R, scikit-learn, H2O, XGBoost, LightGBM, Spark MLlib) themselves.
  • Prerequisites: Requires a robust computing environment, with tests conducted on Amazon EC2 instances (e.g., c3.8xlarge with 60GB RAM, r3.8xlarge with 250GB RAM, and p2.xlarge for GPU testing). Python 3 and R environments are necessary.
  • Resources: Setup involves installing multiple ML libraries and potentially large datasets.
  • Links:

Highlighted Details

  • Scalability: Evaluates performance on datasets up to 10M rows, highlighting memory constraints and single-node vs. distributed capabilities.
  • Algorithm Comparison: Benchmarks linear models, Random Forests, GBMs, and Deep Neural Networks, detailing their trade-offs in accuracy and speed.
  • Tool Performance: Compares implementations across R, Python (scikit-learn), Vowpal Wabbit, H2O, XGBoost, LightGBM, and Spark MLlib.
  • Data Handling: Assesses performance with both one-hot encoded and internally handled categorical features.

Maintenance & Community

The project was primarily active around 2015-2018, with updates noted in January 2018. The author indicates a shift to a successor repository (GBM-perf) for more focused benchmarking, particularly on GBMs and GPU implementations, using Docker for reproducibility.

Licensing & Compatibility

The repository itself is not explicitly licensed in the README. The underlying tools benchmarked have various open-source licenses (e.g., MIT, Apache 2.0). Compatibility for commercial use depends on the licenses of the individual ML libraries used.

Limitations & Caveats

The benchmark is described as "minimal" and "incomplete," with a focus on specific data structures and problem types. The author notes that many results are from 2015 and that newer versions of tools might perform differently. The project's primary focus has shifted to a separate repository.

Health Check
Last commit

2 years ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.