SynapseML  by microsoft

Distributed ML library for large-scale data processing

Created 8 years ago
5,174 stars

Top 9.6% on SourcePulse

GitHubView on GitHub
Project Summary

SynapseML (formerly MMLSpark) is an open-source library designed to simplify the creation of massively scalable machine learning pipelines. It provides simple, composable, and distributed APIs for a wide range of ML tasks, including text analytics, computer vision, and anomaly detection. Built on Apache Spark, it shares the same API as SparkML/MLLib, allowing seamless integration into existing Spark workflows and offering broad language support across Python, R, Scala, Java, and .NET.

How It Works

SynapseML leverages the Apache Spark distributed computing framework, offering APIs that are consistent with SparkML/MLLib. This design allows users to build scalable ML systems by abstracting over various data sources and compute environments. Its composable nature enables the construction of complex pipelines from simpler components, facilitating distributed training and evaluation across single-node, multi-node, or elastic clusters without wasting resources.

Quick Start & Requirements

SynapseML requires Scala 2.12, Spark 3.4+, and Python 3.8+. Installation varies by platform:

  • Microsoft Fabric: Pre-installed.
  • Synapse Analytics/Databricks: Use %%configure magic or Maven coordinates (com.microsoft.azure:synapseml_2.12:1.0.14) with resolver https://mmlspark.azureedge.net/maven.
  • Python Standalone: Install pyspark and configure SparkSession with spark.jars.packages.
  • Spark Submit: Use the --packages option.
  • Docker: Run docker run -it -p 8888:8888 -e ACCEPT_EULA=yes mcr.microsoft.com/mmlspark/release jupyter notebook. Detailed documentation and examples are available on the SynapseML website.

Highlighted Details

  • Integrations include Vowpal Wabbit, Microsoft Cognitive Services, LightGBM, and ONNX for distributed ML tasks.
  • Features Responsible AI tools for model interpretability and bias detection.
  • Offers Spark Serving for low-latency web service deployment of Spark computations.
  • Includes Spark Binding Autogeneration for PySpark and SparklyR.
  • Provides distributed implementations for anomaly detection (Isolation Forest) and scalable KNN models.

Maintenance & Community

The project adheres to the Microsoft Open Source Code of Conduct, with contribution guidelines detailed in CONTRIBUTING.md. Feedback and issue reporting are managed via GitHub Issues. Related projects include Vowpal Wabbit, LightGBM, and Microsoft Cognitive Toolkit.

Licensing & Compatibility

The specific license is not detailed in the provided text. SynapseML is designed for broad compatibility, abstracting over diverse databases, file systems, and cloud data stores, and supporting multiple programming languages.

Limitations & Caveats

Support for R is noted as being under development, with potential for missing custom wrappers. Users must ensure compatibility with the specified Spark and Scala versions. Docker image usage requires acceptance of a End-User License Agreement (EULA).

Health Check
Last Commit

4 days ago

Responsiveness

Inactive

Pull Requests (30d)
15
Issues (30d)
2
Star History
9 stars in the last 30 days

Explore Similar Projects

Starred by Shengjia Zhao Shengjia Zhao(Chief Scientist at Meta Superintelligence Lab), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
14 more.

BIG-bench by google

0.1%
3k
Collaborative benchmark for probing and extrapolating LLM capabilities
Created 4 years ago
Updated 1 year ago
Starred by Aravind Srinivas Aravind Srinivas(Cofounder of Perplexity), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
16 more.

text-to-text-transfer-transformer by google-research

0.1%
6k
Unified text-to-text transformer for NLP research
Created 6 years ago
Updated 6 months ago
Feedback? Help us improve.