vtreat  by WinVector

Dataframe processor for supervised ML

created 11 years ago
285 stars

Top 92.8% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

vtreat is a data frame processor and conditioner designed to prepare real-world data for supervised machine learning. It addresses common data quality issues like missing values, high-cardinality categorical variables, and extreme values, making data suitable for various predictive modeling tasks. The library is available for both R and Python, targeting data scientists and engineers working with messy, real-world datasets.

How It Works

vtreat employs a "y-aware" pre-processing approach, meaning transformations are informed by the relationship between explanatory variables and the outcome variable. It systematically transforms input dataframes into a numeric, NA-free format. Key techniques include creating indicator variables for categorical levels, impact coding for high-cardinality features, and safe replacement of missing values with an accompanying indicator column. This ensures that derived features capture relevant information while mitigating common pitfalls that can derail modeling.

Quick Start & Requirements

Highlighted Details

  • Handles categorical variables with many levels via impact coding and indicator variables.
  • Addresses missing values by replacing them and adding a binary indicator column.
  • Mitigates issues with rare categorical levels and novel levels during application.
  • Provides out-of-sample scoring and significance estimates for variable pruning.

Maintenance & Community

  • Developed by John Mount and Nina Zumel.
  • Active development with recent improvements including parallel processing and generalized effect size calculations.
  • GitHub repository.

Licensing & Compatibility

  • Distributed under a choice of GPL-2 or GPL-3 license.
  • Requires variable and column names to be "tame names" (valid R variable names).

Limitations & Caveats

The library is intended for "tame names" only, meaning column names must be valid R variable names. While it automates many preprocessing steps, it is not a substitute for manual data exploration.

Health Check
Last commit

6 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.