Discover and explore top open-source AI tools and projects—updated daily.
microsoftDistributed ML library for large-scale data processing
Top 9.6% on SourcePulse
SynapseML (formerly MMLSpark) is an open-source library designed to simplify the creation of massively scalable machine learning pipelines. It provides simple, composable, and distributed APIs for a wide range of ML tasks, including text analytics, computer vision, and anomaly detection. Built on Apache Spark, it shares the same API as SparkML/MLLib, allowing seamless integration into existing Spark workflows and offering broad language support across Python, R, Scala, Java, and .NET.
How It Works
SynapseML leverages the Apache Spark distributed computing framework, offering APIs that are consistent with SparkML/MLLib. This design allows users to build scalable ML systems by abstracting over various data sources and compute environments. Its composable nature enables the construction of complex pipelines from simpler components, facilitating distributed training and evaluation across single-node, multi-node, or elastic clusters without wasting resources.
Quick Start & Requirements
SynapseML requires Scala 2.12, Spark 3.4+, and Python 3.8+. Installation varies by platform:
%%configure magic or Maven coordinates (com.microsoft.azure:synapseml_2.12:1.0.14) with resolver https://mmlspark.azureedge.net/maven.pyspark and configure SparkSession with spark.jars.packages.--packages option.docker run -it -p 8888:8888 -e ACCEPT_EULA=yes mcr.microsoft.com/mmlspark/release jupyter notebook.
Detailed documentation and examples are available on the SynapseML website.Highlighted Details
Maintenance & Community
The project adheres to the Microsoft Open Source Code of Conduct, with contribution guidelines detailed in CONTRIBUTING.md. Feedback and issue reporting are managed via GitHub Issues. Related projects include Vowpal Wabbit, LightGBM, and Microsoft Cognitive Toolkit.
Licensing & Compatibility
The specific license is not detailed in the provided text. SynapseML is designed for broad compatibility, abstracting over diverse databases, file systems, and cloud data stores, and supporting multiple programming languages.
Limitations & Caveats
Support for R is noted as being under development, with potential for missing custom wrappers. Users must ensure compatibility with the specified Spark and Scala versions. Docker image usage requires acceptance of a End-User License Agreement (EULA).
4 days ago
Inactive
 Shengjia Zhao(Chief Scientist at Meta Superintelligence Lab), 
google
grahamjenson
google-research
triton-inference-server
tensorflow