automl-gs by minimaxir

AutoML tool for generating ML/DL models and Python code from CSV data

Created 7 years ago

1,866 stars

Top 23.0% on SourcePulse

View on GitHub

4 Experts Love This Project

Jeff Hammerbacher

Cofounder of Cloudera

Andrey Vasnetsov

Cofounder of Qdrant

Eric Zhu

Coauthor of AutoGen; Research Scientist at Microsoft Research

Cristóbal Valenzuela

Cofounder of Runway

Project Summary

This project provides an AutoML solution for tabular data, enabling users to generate high-performing machine learning or deep learning models with native Python code. It targets citizen data scientists and engineers, offering a zero-code interface to create optimized data transformation and prediction pipelines, abstracting complex preprocessing and modeling techniques.

How It Works

automl-gs generates raw Python code using Jinja templates and trains models in a subprocess, iterating through different hyperparameters. It infers data types, applies framework-specific ETL strategies (e.g., datetime encoding, text embeddings/vectorization), and constructs models using specified frameworks like TensorFlow/Keras or XGBoost. The best performing model and its associated pipeline code are saved, allowing for easy integration and prediction without ongoing dependency on the tool.

Quick Start & Requirements

Install via pip: pip3 install automl_gs
Install ML/DL framework (e.g., tensorflow, xgboost).
Run from CLI: automl_gs <csv_path> <target_field>
Example: automl_gs titanic.csv Survived --framework xgboost --num_trials 1000
Python API: from automl_gs import automl_grid_search; automl_grid_search('titanic.csv', 'Survived')
Official examples: Jupyter Notebook

Highlighted Details

Generates native Python code for TensorFlow (tf.keras) and XGBoost, with plans for Catboost and LightGBM.
Handles messy datasets, including datetime/categorical encoding and column name normalization.
Outputs include model.py, pipeline.py, requirements.txt, serialized encoders (JSON), and detailed metrics.
Supports retraining generated models on new data without code changes.
Allows pausing hyperparameter search with results saved after each trial.

Maintenance & Community

Maintained by Max Woolf (@minimaxir).
Project supported via Patreon.

Licensing & Compatibility

License: MIT.
Generated code is unlicensed, allowing the owner to choose its license. Compatible with commercial and closed-source use.

Limitations & Caveats

Known issues with Anaconda and Windows environments, and with dataset field names starting with numbers.
Primarily focused on tabular data; not suitable for sequence prediction tasks.
Hyperparameter optimization may yield optimistic validation set performance.

Health Check

Last Commit

6 years ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days