automl-gs  by minimaxir

AutoML tool for generating ML/DL models and Python code from CSV data

created 6 years ago
1,859 stars

Top 23.8% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides an AutoML solution for tabular data, enabling users to generate high-performing machine learning or deep learning models with native Python code. It targets citizen data scientists and engineers, offering a zero-code interface to create optimized data transformation and prediction pipelines, abstracting complex preprocessing and modeling techniques.

How It Works

automl-gs generates raw Python code using Jinja templates and trains models in a subprocess, iterating through different hyperparameters. It infers data types, applies framework-specific ETL strategies (e.g., datetime encoding, text embeddings/vectorization), and constructs models using specified frameworks like TensorFlow/Keras or XGBoost. The best performing model and its associated pipeline code are saved, allowing for easy integration and prediction without ongoing dependency on the tool.

Quick Start & Requirements

  • Install via pip: pip3 install automl_gs
  • Install ML/DL framework (e.g., tensorflow, xgboost).
  • Run from CLI: automl_gs <csv_path> <target_field>
  • Example: automl_gs titanic.csv Survived --framework xgboost --num_trials 1000
  • Python API: from automl_gs import automl_grid_search; automl_grid_search('titanic.csv', 'Survived')
  • Official examples: Jupyter Notebook

Highlighted Details

  • Generates native Python code for TensorFlow (tf.keras) and XGBoost, with plans for Catboost and LightGBM.
  • Handles messy datasets, including datetime/categorical encoding and column name normalization.
  • Outputs include model.py, pipeline.py, requirements.txt, serialized encoders (JSON), and detailed metrics.
  • Supports retraining generated models on new data without code changes.
  • Allows pausing hyperparameter search with results saved after each trial.

Maintenance & Community

  • Maintained by Max Woolf (@minimaxir).
  • Project supported via Patreon.

Licensing & Compatibility

  • License: MIT.
  • Generated code is unlicensed, allowing the owner to choose its license. Compatible with commercial and closed-source use.

Limitations & Caveats

  • Known issues with Anaconda and Windows environments, and with dataset field names starting with numbers.
  • Primarily focused on tabular data; not suitable for sequence prediction tasks.
  • Hyperparameter optimization may yield optimistic validation set performance.
Health Check
Last commit

5 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.