python-bigquery-dataframes  by googleapis

BigQuery-powered Python DataFrames and ML

Created 2 years ago
261 stars

Top 97.5% on SourcePulse

GitHubView on GitHub
Project Summary

BigQuery DataFrames (BigFrames) offers a Pythonic DataFrame and ML API leveraging the BigQuery engine, targeting data scientists and analysts who want to use familiar pandas and scikit-learn interfaces for large-scale data processing and machine learning directly within BigQuery. This allows for seamless migration of pandas workloads and efficient execution of ML tasks without moving data out of BigQuery.

How It Works

BigFrames translates DataFrame operations into BigQuery SQL queries, executing them directly on the BigQuery backend. This approach avoids costly data transfers and leverages BigQuery's distributed processing capabilities for performance. For ML, it integrates with BigQuery ML and supports remote functions for custom model execution, abstracting away the complexities of distributed training and inference.

Quick Start & Requirements

  • Install: pip install --upgrade bigframes
  • Prerequisites: Google Cloud project with BigQuery API enabled, Application Default Credentials configured locally.
  • Documentation: Introduction, Sample notebooks, API reference

Highlighted Details

  • Provides a pandas-compatible API (bigframes.pandas) for data manipulation.
  • Offers a scikit-learn-like API (bigframes.ml) for machine learning tasks.
  • Integrates with BigQuery ML and supports custom remote functions.
  • Version 2.0 introduced significant changes for security and performance, including stricter controls on large results and remote function execution.

Maintenance & Community

  • Developed by Google, with General Availability (GA) status.
  • Feedback channel: bigframes-feedback@google.com.

Licensing & Compatibility

  • Distributed under the Apache-2.0 license.
  • Contains code derived from Ibis, pandas, Python, scikit-learn, and XGBoost.
  • Compatible with commercial and closed-source applications.

Limitations & Caveats

Version 2.0 enforces stricter defaults for allow_large_results (defaulting to False) and remote function security, requiring explicit configuration for operations exceeding 10GB or for specific service account usage. Users migrating from pre-2.0 versions must adapt to these changes or pin to an older version.

Health Check
Last Commit

19 hours ago

Responsiveness

1 day

Pull Requests (30d)
97
Issues (30d)
1
Star History
6 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.