SynapseML by microsoft

Distributed ML library for large-scale data processing

Created 8 years ago

5,200 stars

Top 9.5% on SourcePulse

View on GitHub

5 Experts Love This Project

Jeff Hammerbacher

Cofounder of Cloudera

and 1 more!

Project Summary

SynapseML (formerly MMLSpark) is an open-source library designed to simplify the creation of massively scalable machine learning pipelines. It provides simple, composable, and distributed APIs for a wide range of ML tasks, including text analytics, computer vision, and anomaly detection. Built on Apache Spark, it shares the same API as SparkML/MLLib, allowing seamless integration into existing Spark workflows and offering broad language support across Python, R, Scala, Java, and .NET.

How It Works

SynapseML leverages the Apache Spark distributed computing framework, offering APIs that are consistent with SparkML/MLLib. This design allows users to build scalable ML systems by abstracting over various data sources and compute environments. Its composable nature enables the construction of complex pipelines from simpler components, facilitating distributed training and evaluation across single-node, multi-node, or elastic clusters without wasting resources.

Quick Start & Requirements

SynapseML requires Scala 2.12, Spark 3.4+, and Python 3.8+. Installation varies by platform:

Microsoft Fabric: Pre-installed.
Synapse Analytics/Databricks: Use %%configure magic or Maven coordinates (com.microsoft.azure:synapseml_2.12:1.0.14) with resolver https://mmlspark.azureedge.net/maven.
Python Standalone: Install pyspark and configure SparkSession with spark.jars.packages.
Spark Submit: Use the --packages option.
Docker: Run docker run -it -p 8888:8888 -e ACCEPT_EULA=yes mcr.microsoft.com/mmlspark/release jupyter notebook. Detailed documentation and examples are available on the SynapseML website.

Highlighted Details

Integrations include Vowpal Wabbit, Microsoft Cognitive Services, LightGBM, and ONNX for distributed ML tasks.
Features Responsible AI tools for model interpretability and bias detection.
Offers Spark Serving for low-latency web service deployment of Spark computations.
Includes Spark Binding Autogeneration for PySpark and SparklyR.
Provides distributed implementations for anomaly detection (Isolation Forest) and scalable KNN models.

Maintenance & Community

The project adheres to the Microsoft Open Source Code of Conduct, with contribution guidelines detailed in CONTRIBUTING.md. Feedback and issue reporting are managed via GitHub Issues. Related projects include Vowpal Wabbit, LightGBM, and Microsoft Cognitive Toolkit.

Licensing & Compatibility

The specific license is not detailed in the provided text. SynapseML is designed for broad compatibility, abstracting over diverse databases, file systems, and cloud data stores, and supporting multiple programming languages.

Limitations & Caveats

Support for R is noted as being under development, with potential for missing custom wrappers. Users must ensure compatibility with the specified Spark and Scala versions. Docker image usage requires acceptance of a End-User License Agreement (EULA).

Health Check

Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

5 stars in the last 30 days