sparklyr  by sparklyr

R interface for Apache Spark

created 9 years ago
966 stars

Top 38.9% on sourcepulse

GitHubView on GitHub
Project Summary

sparklyr provides an R interface for Apache Spark, enabling R users to leverage Spark's distributed computing capabilities for data manipulation, analysis, and machine learning. It allows users to connect to local or remote Spark clusters and utilize familiar R syntax, particularly dplyr verbs, to interact with Spark DataFrames.

How It Works

sparklyr translates R code, primarily dplyr operations and SQL queries, into Spark operations. It establishes a connection to a Spark cluster and then sends commands to Spark to execute these operations in a distributed manner. Results can be brought back into R for further analysis or visualization. The package also facilitates distributed R code execution via spark_apply and integrates with Spark's MLlib for machine learning workflows.

Quick Start & Requirements

  • Install from CRAN: install.packages("sparklyr")
  • Install a local Spark version: library(sparklyr); spark_install()
  • Connect to Spark: library(sparklyr); sc <- spark_connect(master = "local")
  • For development, install from GitHub: devtools::install_github("sparklyr/sparklyr")
  • Requires R and a compatible Spark installation.

Highlighted Details

  • Seamless integration with dplyr for expressive data manipulation on Spark.
  • Supports direct SQL query execution via a DBI interface.
  • Orchestrates Spark MLlib algorithms for machine learning pipelines.
  • Enables distributed R code execution using spark_apply.
  • Offers utilities for caching tables, accessing Spark logs, and the Spark web console.
  • Integrates with RStudio for a streamlined user experience.
  • Supports connections via Livy and Databricks Connect v2.

Maintenance & Community

The project is actively maintained by a core team and community contributors. Further community engagement can be found via their website and associated resources.

Licensing & Compatibility

sparklyr is licensed under the Apache License 2.0. This permissive license allows for commercial use and integration with closed-source applications.

Limitations & Caveats

While sparklyr aims to provide a comprehensive R interface, performance can vary depending on the complexity of operations and the underlying Spark cluster configuration. Some advanced Spark features or very specific MLlib algorithms might require direct Scala/Java interaction or alternative libraries.

Health Check
Last commit

4 days ago

Responsiveness

1 day

Pull Requests (30d)
2
Issues (30d)
4
Star History
5 stars in the last 90 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Samuel Colvin Samuel Colvin(Author of Pydantic, Pydantic Logfire, PydanticAI), and
4 more.

quokka by marsupialtail

0.1%
1k
Distributed query engine for time series data
created 3 years ago
updated 11 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera).

spark-nlp by JohnSnowLabs

0.1%
4k
NLP library for scalable ML pipelines
created 7 years ago
updated 1 day ago
Feedback? Help us improve.