sparklyr  by sparklyr

R interface for Apache Spark

Created 9 years ago
966 stars

Top 38.1% on SourcePulse

GitHubView on GitHub
Project Summary

sparklyr provides an R interface for Apache Spark, enabling R users to leverage Spark's distributed computing capabilities for data manipulation, analysis, and machine learning. It allows users to connect to local or remote Spark clusters and utilize familiar R syntax, particularly dplyr verbs, to interact with Spark DataFrames.

How It Works

sparklyr translates R code, primarily dplyr operations and SQL queries, into Spark operations. It establishes a connection to a Spark cluster and then sends commands to Spark to execute these operations in a distributed manner. Results can be brought back into R for further analysis or visualization. The package also facilitates distributed R code execution via spark_apply and integrates with Spark's MLlib for machine learning workflows.

Quick Start & Requirements

  • Install from CRAN: install.packages("sparklyr")
  • Install a local Spark version: library(sparklyr); spark_install()
  • Connect to Spark: library(sparklyr); sc <- spark_connect(master = "local")
  • For development, install from GitHub: devtools::install_github("sparklyr/sparklyr")
  • Requires R and a compatible Spark installation.

Highlighted Details

  • Seamless integration with dplyr for expressive data manipulation on Spark.
  • Supports direct SQL query execution via a DBI interface.
  • Orchestrates Spark MLlib algorithms for machine learning pipelines.
  • Enables distributed R code execution using spark_apply.
  • Offers utilities for caching tables, accessing Spark logs, and the Spark web console.
  • Integrates with RStudio for a streamlined user experience.
  • Supports connections via Livy and Databricks Connect v2.

Maintenance & Community

The project is actively maintained by a core team and community contributors. Further community engagement can be found via their website and associated resources.

Licensing & Compatibility

sparklyr is licensed under the Apache License 2.0. This permissive license allows for commercial use and integration with closed-source applications.

Limitations & Caveats

While sparklyr aims to provide a comprehensive R interface, performance can vary depending on the complexity of operations and the underlying Spark cluster configuration. Some advanced Spark features or very specific MLlib algorithms might require direct Scala/Java interaction or alternative libraries.

Health Check
Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
1
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Luis Capelo Luis Capelo(Cofounder of Lightning AI), and
1 more.

spark-nlp by JohnSnowLabs

0.0%
4k
NLP library for scalable ML pipelines
Created 8 years ago
Updated 3 days ago
Feedback? Help us improve.