DataFrame API for running PySpark code on various database engines
Top 71.7% on sourcepulse
SQLFrame provides a PySpark-compatible DataFrame API for executing data transformations directly on various SQL database engines, eliminating the need for Spark clusters. It targets users who want to leverage their existing database's processing power, run PySpark code locally without Spark overhead, or generate SQL representations of their DataFrame logic for debugging and sharing.
How It Works
SQLFrame translates PySpark DataFrame operations into SQL queries tailored for specific database backends. It supports multiple engines like BigQuery, Databricks, DuckDB, PostgreSQL, Snowflake, and Spark, with Redshift in development. A "Standalone" session can generate SQL without connecting to a database. The library allows customization of SQL dialects for input and output, and can optionally integrate with OpenAI for enhanced SQL generation.
Quick Start & Requirements
Install with pip install "sqlframe[<engine>]"
(e.g., sqlframe[bigquery]
, sqlframe[duckdb]
) or conda install -c conda-forge sqlframe
. Specific engine documentation may have additional setup instructions.
Highlighted Details
Maintenance & Community
No specific community links or contributor information are provided in the README.
Licensing & Compatibility
The README does not specify a license. Compatibility for commercial use or closed-source linking is not detailed.
Limitations & Caveats
The Redshift engine is noted as being in development with lacking test coverage and documentation. The README does not specify a license, which may impact commercial adoption.
1 day ago
1 day