horovod by horovod

Distributed training framework for TF, Keras, PyTorch, and MXNet

Created 8 years ago

14,645 stars

Top 3.4% on SourcePulse

View on GitHub

24 Experts Love This Project

Aravind Srinivas

Cofounder of Perplexity

François Chollet

Author of Keras; Cofounder of Ndea, ARC Prize

Wing Lian

Founder of Axolotl AI

Travis Fischer

Founder of Agentic

and 20 more!

Project Summary

Horovod is a distributed deep learning training framework designed to simplify and accelerate the scaling of training workloads across multiple GPUs and nodes. It targets researchers and engineers working with TensorFlow, Keras, PyTorch, and Apache MXNet, enabling them to leverage distributed computing with minimal code changes and achieve significant performance gains.

How It Works

Horovod utilizes a ring-based AllReduce algorithm, inspired by Message Passing Interface (MPI) concepts, to efficiently synchronize gradients across workers. This approach minimizes communication overhead by interleaving gradient computation with communication and supports tensor fusion to batch small AllReduce operations, further boosting performance. It requires minimal code modifications to existing single-GPU training scripts.

Quick Start & Requirements

Install via pip: pip install horovod
For GPU support with NCCL: HOROVOD_GPU_OPERATIONS=NCCL pip install horovod
Prerequisites: CMake, a C++17-compliant compiler (g++8+ for TF 2.10+), and potentially MPI/NCCL depending on the installation.
Official Documentation: https://horovod.readthedocs.io/en/latest/

Highlighted Details

Achieves 90% scaling efficiency for Inception V3 and ResNet-101 on large clusters.
Supports TensorFlow, Keras, PyTorch, and MXNet.
Features like Tensor Fusion, Horovod Timeline, and automated performance tuning optimize distributed training.
Can run with or without MPI, utilizing Gloo as an alternative backend.

Maintenance & Community

Hosted by the LF AI & Data Foundation.
Active community with Slack channels for discussion and announcements.
Slack Community

Licensing & Compatibility

Apache 2.0 License.
Permissive license suitable for commercial use and integration with closed-source applications.

Limitations & Caveats

Initial setup may require installing and configuring MPI and NCCL, which can be complex for infrastructure teams.
While designed for ease of use, understanding distributed training concepts and Horovod's specific API is beneficial for optimal performance.

Health Check

Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

16 stars in the last 30 days