benchm-ml by szilard

ML benchmark for speed, scalability, accuracy of classification libraries

Created 10 years ago

1,890 stars

Top 22.8% on SourcePulse

View on GitHub

6 Experts Love This Project

Boris Cherny

Creator of Claude Code; MTS at Anthropic

Cofounder of Poolside AI

and 2 more!

Project Summary

This repository provides a minimal benchmark for evaluating the scalability, speed, and accuracy of popular open-source machine learning implementations for binary classification tasks. It targets data scientists and engineers working with tabular data, offering insights into which algorithms and libraries perform best on datasets with millions of observations and thousands of features.

How It Works

The benchmark focuses on binary classification with numeric and categorical inputs, simulating business applications like credit scoring or churn prediction. It varies dataset sizes from 10K to 10M observations with approximately 1K features (after one-hot encoding categoricals). Performance is measured by training time, peak RAM usage, CPU utilization, and AUC for algorithms including logistic regression, SVMs, random forests, gradient boosting machines (GBMs), and deep neural networks.

Quick Start & Requirements

Installation: The README does not provide a direct installation command. Users are expected to set up the environments for the various ML tools (R, scikit-learn, H2O, XGBoost, LightGBM, Spark MLlib) themselves.
Prerequisites: Requires a robust computing environment, with tests conducted on Amazon EC2 instances (e.g., c3.8xlarge with 60GB RAM, r3.8xlarge with 250GB RAM, and p2.xlarge for GPU testing). Python 3 and R environments are necessary.
Resources: Setup involves installing multiple ML libraries and potentially large datasets.
Links:
- Successor GBM benchmark: https://github.com/szilard/GBM-perf
- Random Forest blog post: https://www.h2o.ai/blog/random-forest-benchmarks/

Highlighted Details

Scalability: Evaluates performance on datasets up to 10M rows, highlighting memory constraints and single-node vs. distributed capabilities.
Algorithm Comparison: Benchmarks linear models, Random Forests, GBMs, and Deep Neural Networks, detailing their trade-offs in accuracy and speed.
Tool Performance: Compares implementations across R, Python (scikit-learn), Vowpal Wabbit, H2O, XGBoost, LightGBM, and Spark MLlib.
Data Handling: Assesses performance with both one-hot encoded and internally handled categorical features.

Maintenance & Community

The project was primarily active around 2015-2018, with updates noted in January 2018. The author indicates a shift to a successor repository (GBM-perf) for more focused benchmarking, particularly on GBMs and GPU implementations, using Docker for reproducibility.

Licensing & Compatibility

The repository itself is not explicitly licensed in the README. The underlying tools benchmarked have various open-source licenses (e.g., MIT, Apache 2.0). Compatibility for commercial use depends on the licenses of the individual ML libraries used.

Limitations & Caveats

The benchmark is described as "minimal" and "incomplete," with a focus on specific data structures and problem types. The author notes that many results are from 2015 and that newer versions of tools might perform differently. The project's primary focus has shifted to a separate repository.

Health Check

Last Commit

3 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days