flexflow-train by flexflow

Accelerating distributed deep learning training

Created 7 years ago

1,843 stars

Top 23.3% on SourcePulse

View on GitHub

13 Experts Love This Project

Research Scientist at Meta Superintelligence Lab

Nikola Borisov

Founder and CEO of DeepInfra

and 9 more!

Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> FlexFlow Train is a deep learning framework engineered to accelerate distributed deep neural network (DNN) training. It tackles the complex problem of identifying optimal parallelization strategies by automating the search process, offering significant benefits to researchers and engineers engaged in large-scale DNN model development and deployment.

How It Works

The core innovation lies in its automated discovery of efficient parallelization strategies for distributed DNN training. This approach leverages joint optimization of algebraic transformations and parallelization techniques, moving beyond conventional data and model parallelism to uncover novel and performant execution plans. This dynamic strategy aims to maximize training throughput and efficiency.

Quick Start & Requirements

The provided README does not detail specific installation commands, prerequisites, or estimated setup times. Users interested in contributing code are directed to consult the CONTRIBUTING.md file.

Highlighted Details

The project's technical foundation is supported by multiple academic publications, including "Unity: Accelerating DNN Training Through Joint Optimization of Algebraic Transformations and Parallelization" (OSDI, 2022), "Beyond Data and Model Parallelism for Deep Neural Networks" (MLSys, 2019), and "Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks" (ICML, 2018).
It specifically targets accelerating DNN training by automatically discovering fast parallelization strategies, a key differentiator for performance-critical applications.

Maintenance & Community

FlexFlow Train is a collaborative effort, developed and maintained by prominent institutions including CMU, Facebook, Los Alamos National Lab, MIT, Stanford, and UCSD. The project encourages user engagement through issue submissions for bug reports and suggestions.

Licensing & Compatibility

The framework is licensed under the permissive Apache License 2.0, which generally allows for broad usage, modification, and distribution, including within commercial and closed-source applications.

Limitations & Caveats

A significant caveat is the repository's recent split: inference and serving functionalities have been migrated to a separate flexflow-serve repository. Users requiring these capabilities must refer to the latter. The current README does not specify any other known limitations or caveats regarding the training framework's functionality or stability.

Health Check

Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

8 stars in the last 30 days