pyspark-ai by pyspark-ai

English SDK for Apache Spark simplifies PySpark DataFrame manipulation

Created 2 years ago

878 stars

Top 41.0% on SourcePulse

View on GitHub

2 Experts Love This Project

Amanpreet Singh

Cofounder of Contextual AI

Reynold Xin

Cofounder of Databricks

Project Summary

This project provides an English SDK for Apache Spark, enabling users to interact with Spark DataFrames using natural language instructions. It aims to simplify Spark development for data analysts and engineers by translating English commands into PySpark code, thereby increasing accessibility and productivity.

How It Works

The SDK leverages large language models (LLMs), specifically recommending GPT-4 for optimal performance via the OpenAI API. Users initialize a SparkAI instance, optionally configuring it with custom LLMs or a vector store for improved query accuracy through similarity search. The core functionality is accessed via the .ai accessor on Spark DataFrames, allowing English prompts for transformations and plotting.

Quick Start & Requirements

Install via pip: pip install pyspark-ai
Optional dependencies for plotting: pip install "pyspark-ai[plot]" or pip install "pyspark-ai[all]"
Requires OpenAI API key set as OPENAI_API_KEY environment variable.
Documentation: Blog Post, Demo Video, Breakout Session

Highlighted Details

Translates English prompts into PySpark DataFrame transformations.
Supports generating plots directly from DataFrames using natural language.
Vector similarity search can be enabled to improve prompt accuracy.
Integrates with LangChain for flexible LLM backend configuration (e.g., Azure OpenAI).

Maintenance & Community

The project is actively seeking contributions and is in the early stages of development, focusing on enhancing test cases and CI/CD pipelines. For questions and contributions, refer to the GitHub repository and open an issue.

Licensing & Compatibility

Licensed under the Apache License 2.0. This permissive license allows for commercial use and integration with closed-source applications.

Limitations & Caveats

The effectiveness and accuracy of transformations are dependent on the underlying LLM's capabilities and the clarity of the English prompts. While GPT-4 is recommended, performance may vary with other models. The project is in early development, suggesting potential for breaking changes and evolving features.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days