sparklyr by sparklyr

R interface for Apache Spark

Created 9 years ago

970 stars

Top 38.0% on SourcePulse

View on GitHub

2 Experts Love This Project

Jeff Hammerbacher

Cofounder of Cloudera

Wes McKinney

Author of Pandas

Project Summary

sparklyr provides an R interface for Apache Spark, enabling R users to leverage Spark's distributed computing capabilities for data manipulation, analysis, and machine learning. It allows users to connect to local or remote Spark clusters and utilize familiar R syntax, particularly dplyr verbs, to interact with Spark DataFrames.

How It Works

sparklyr translates R code, primarily dplyr operations and SQL queries, into Spark operations. It establishes a connection to a Spark cluster and then sends commands to Spark to execute these operations in a distributed manner. Results can be brought back into R for further analysis or visualization. The package also facilitates distributed R code execution via spark_apply and integrates with Spark's MLlib for machine learning workflows.

Quick Start & Requirements

Install from CRAN: install.packages("sparklyr")
Install a local Spark version: library(sparklyr); spark_install()
Connect to Spark: library(sparklyr); sc <- spark_connect(master = "local")
For development, install from GitHub: devtools::install_github("sparklyr/sparklyr")
Requires R and a compatible Spark installation.

Highlighted Details

Seamless integration with dplyr for expressive data manipulation on Spark.
Supports direct SQL query execution via a DBI interface.
Orchestrates Spark MLlib algorithms for machine learning pipelines.
Enables distributed R code execution using spark_apply.
Offers utilities for caching tables, accessing Spark logs, and the Spark web console.
Integrates with RStudio for a streamlined user experience.
Supports connections via Livy and Databricks Connect v2.

Maintenance & Community

The project is actively maintained by a core team and community contributors. Further community engagement can be found via their website and associated resources.

Licensing & Compatibility

sparklyr is licensed under the Apache License 2.0. This permissive license allows for commercial use and integration with closed-source applications.

Limitations & Caveats

While sparklyr aims to provide a comprehensive R interface, performance can vary depending on the complexity of operations and the underlying Spark cluster configuration. Some advanced Spark features or very specific MLlib algorithms might require direct Scala/Java interaction or alternative libraries.

Health Check

Last Commit

5 days ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days