Upgini is a Python library designed to simplify and automate the process of finding and integrating external data features into machine learning pipelines. It targets data scientists and ML engineers looking to boost model accuracy by leveraging a vast array of public, community, and premium data sources, including LLM-generated features.
How It Works
Upgini acts as an intelligent data search engine. It uses LLMs, GraphNNs, and RNNs to automatically optimize and generate relevant features from hundreds of external data sources. The library intelligently searches for features that demonstrably improve model accuracy, rather than just those correlated with the target variable. It also offers automated search key augmentation, stability checks for accuracy gains, and a scikit-learn compatible interface for seamless integration.
Quick Start & Requirements
- Install via pip:
%pip install upgini
- Docker installation and usage are also detailed.
- Requires Python.
- For full capabilities, including premium data sources and LLM feature generation, registration for a free API key is recommended.
- Official quick-start guides and tutorials are available via Colab notebooks.
Highlighted Details
- Automatically identifies and generates features that improve ML model accuracy.
- Supports automated feature generation using LLMs, GraphNNs, and RNNs.
- Offers automated search key augmentation to broaden data matching.
- Calculates accuracy metrics and uplifts, with stability checks on out-of-time data.
- Provides a scikit-learn compatible interface for easy integration.
- Supports various supervised ML tasks on tabular data, including time series.
Maintenance & Community
- The project is in beta and actively seeking community contributions.
- A Slack community is available for support and discussion.
- Users can propose new data sources.
Licensing & Compatibility
- The README does not explicitly state a license.
- Compatibility with commercial or closed-source linking is not specified.
Limitations & Caveats
- The library is currently in beta, indicating potential instability or incomplete features.
- Some advanced features, like enrichment with phone numbers, emails, and IP addresses, require a registered API key.
- LLM feature generation was time-limited for non-registered users as of the README's last update.