FedML by FedML-AI

ML library for distributed training, model serving, and federated learning

Created 5 years ago

3,994 stars

Top 12.1% on SourcePulse

View on GitHub

3 Experts Love This Project

Georgios Konstantopoulos

CTO, General Partner at Paradigm

Omar Sanseviero

DevRel at Google DeepMind

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Project Summary

FedML is a unified, scalable machine learning library designed for distributed training, model serving, and federated learning across diverse hardware environments. It targets developers and researchers needing to run AI jobs efficiently on any GPU cloud or on-premise cluster, with TensorOpera AI offering a complementary platform for generative AI and LLMs.

How It Works

FedML provides a unified MLOps layer with Studio for accessing and fine-tuning foundational models, and a Job Store for pre-built AI tasks. Its scheduler, TensorOpera Launch, optimizes GPU resource allocation and automates job execution across various compute topologies. The compute layer includes platforms for scalable model serving (Deploy), large-scale distributed training (Train), and federated learning (Federate), leveraging FedML's core library for cross-device and cross-cloud operations.

Quick Start & Requirements

Installation: pip install fedml
Prerequisites: Python 3.7+, PyTorch or TensorFlow. GPU and CUDA recommended for performance.
Documentation: https://docs.TensorOpera.ai

Highlighted Details

Unified library for distributed training, model serving, and federated learning.
TensorOpera Launch acts as a cross-cloud scheduler for efficient GPU resource utilization.
Supports on-device training on smartphones and cross-cloud GPU servers via federated learning.
Offers pre-built jobs and foundational models for generative AI and LLMs.

Maintenance & Community

Community channels: Slack (https://join.slack.com/t/fedml/shared_invite/zt-havwx1ee-a1xfOUrATNfc9DFqU~r34w), Discord (https://discord.gg/9xkW8ae6RV).
Adheres to Contributor Covenant for community contributions.

Licensing & Compatibility

License: Apache 2.0.
Compatibility: Permissive license allows for commercial use and integration with closed-source projects.

Limitations & Caveats

The project is heavily integrated with the TensorOpera AI platform, suggesting potential vendor lock-in or a focus on their ecosystem for advanced features. The README mentions "world’s first FLOps" which may indicate early-stage or experimental features within the federated learning component.

Health Check

Last Commit

2 months ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

13 stars in the last 30 days