flexflow-train  by flexflow

Accelerating distributed deep learning training

Created 6 years ago
1,837 stars

Top 23.5% on SourcePulse

GitHubView on GitHub
Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> FlexFlow Train is a deep learning framework engineered to accelerate distributed deep neural network (DNN) training. It tackles the complex problem of identifying optimal parallelization strategies by automating the search process, offering significant benefits to researchers and engineers engaged in large-scale DNN model development and deployment.

How It Works

The core innovation lies in its automated discovery of efficient parallelization strategies for distributed DNN training. This approach leverages joint optimization of algebraic transformations and parallelization techniques, moving beyond conventional data and model parallelism to uncover novel and performant execution plans. This dynamic strategy aims to maximize training throughput and efficiency.

Quick Start & Requirements

The provided README does not detail specific installation commands, prerequisites, or estimated setup times. Users interested in contributing code are directed to consult the CONTRIBUTING.md file.

Highlighted Details

  • The project's technical foundation is supported by multiple academic publications, including "Unity: Accelerating DNN Training Through Joint Optimization of Algebraic Transformations and Parallelization" (OSDI, 2022), "Beyond Data and Model Parallelism for Deep Neural Networks" (MLSys, 2019), and "Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks" (ICML, 2018).
  • It specifically targets accelerating DNN training by automatically discovering fast parallelization strategies, a key differentiator for performance-critical applications.

Maintenance & Community

FlexFlow Train is a collaborative effort, developed and maintained by prominent institutions including CMU, Facebook, Los Alamos National Lab, MIT, Stanford, and UCSD. The project encourages user engagement through issue submissions for bug reports and suggestions.

Licensing & Compatibility

The framework is licensed under the permissive Apache License 2.0, which generally allows for broad usage, modification, and distribution, including within commercial and closed-source applications.

Limitations & Caveats

A significant caveat is the repository's recent split: inference and serving functionalities have been migrated to a separate flexflow-serve repository. Users requiring these capabilities must refer to the latter. The current README does not specify any other known limitations or caveats regarding the training framework's functionality or stability.

Health Check
Last Commit

14 hours ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
1
Star History
9 stars in the last 30 days

Explore Similar Projects

Starred by Shengjia Zhao Shengjia Zhao(Chief Scientist at Meta Superintelligence Lab), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
14 more.

BIG-bench by google

0.1%
3k
Collaborative benchmark for probing and extrapolating LLM capabilities
Created 4 years ago
Updated 1 year ago
Starred by Aravind Srinivas Aravind Srinivas(Cofounder of Perplexity), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
16 more.

text-to-text-transfer-transformer by google-research

0.1%
6k
Unified text-to-text transformer for NLP research
Created 6 years ago
Updated 5 months ago
Feedback? Help us improve.