Discover and explore top open-source AI tools and projects—updated daily.
English SDK for Apache Spark simplifies PySpark DataFrame manipulation
Top 41.0% on SourcePulse
This project provides an English SDK for Apache Spark, enabling users to interact with Spark DataFrames using natural language instructions. It aims to simplify Spark development for data analysts and engineers by translating English commands into PySpark code, thereby increasing accessibility and productivity.
How It Works
The SDK leverages large language models (LLMs), specifically recommending GPT-4 for optimal performance via the OpenAI API. Users initialize a SparkAI
instance, optionally configuring it with custom LLMs or a vector store for improved query accuracy through similarity search. The core functionality is accessed via the .ai
accessor on Spark DataFrames, allowing English prompts for transformations and plotting.
Quick Start & Requirements
pip install pyspark-ai
pip install "pyspark-ai[plot]"
or pip install "pyspark-ai[all]"
OPENAI_API_KEY
environment variable.Highlighted Details
Maintenance & Community
The project is actively seeking contributions and is in the early stages of development, focusing on enhancing test cases and CI/CD pipelines. For questions and contributions, refer to the GitHub repository and open an issue.
Licensing & Compatibility
Licensed under the Apache License 2.0. This permissive license allows for commercial use and integration with closed-source applications.
Limitations & Caveats
The effectiveness and accuracy of transformations are dependent on the underlying LLM's capabilities and the clarity of the English prompts. While GPT-4 is recommended, performance may vary with other models. The project is in early development, suggesting potential for breaking changes and evolving features.
1 year ago
Inactive