horovod  by horovod

Distributed training framework for TF, Keras, PyTorch, and MXNet

Created 8 years ago
14,592 stars

Top 3.4% on SourcePulse

GitHubView on GitHub
Project Summary

Horovod is a distributed deep learning training framework designed to simplify and accelerate the scaling of training workloads across multiple GPUs and nodes. It targets researchers and engineers working with TensorFlow, Keras, PyTorch, and Apache MXNet, enabling them to leverage distributed computing with minimal code changes and achieve significant performance gains.

How It Works

Horovod utilizes a ring-based AllReduce algorithm, inspired by Message Passing Interface (MPI) concepts, to efficiently synchronize gradients across workers. This approach minimizes communication overhead by interleaving gradient computation with communication and supports tensor fusion to batch small AllReduce operations, further boosting performance. It requires minimal code modifications to existing single-GPU training scripts.

Quick Start & Requirements

  • Install via pip: pip install horovod
  • For GPU support with NCCL: HOROVOD_GPU_OPERATIONS=NCCL pip install horovod
  • Prerequisites: CMake, a C++17-compliant compiler (g++8+ for TF 2.10+), and potentially MPI/NCCL depending on the installation.
  • Official Documentation: https://horovod.readthedocs.io/en/latest/

Highlighted Details

  • Achieves 90% scaling efficiency for Inception V3 and ResNet-101 on large clusters.
  • Supports TensorFlow, Keras, PyTorch, and MXNet.
  • Features like Tensor Fusion, Horovod Timeline, and automated performance tuning optimize distributed training.
  • Can run with or without MPI, utilizing Gloo as an alternative backend.

Maintenance & Community

  • Hosted by the LF AI & Data Foundation.
  • Active community with Slack channels for discussion and announcements.
  • Slack Community

Licensing & Compatibility

  • Apache 2.0 License.
  • Permissive license suitable for commercial use and integration with closed-source applications.

Limitations & Caveats

  • Initial setup may require installing and configuring MPI and NCCL, which can be complex for infrastructure teams.
  • While designed for ease of use, understanding distributed training concepts and Horovod's specific API is beneficial for optimal performance.
Health Check
Last Commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
0
Star History
29 stars in the last 30 days

Explore Similar Projects

Starred by Tri Dao Tri Dao(Chief Scientist at Together AI), Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and
1 more.

oslo by tunib-ai

0%
309
Framework for large-scale transformer optimization
Created 3 years ago
Updated 3 years ago
Starred by Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
1 more.

VeOmni by ByteDance-Seed

3.4%
1k
Framework for scaling multimodal model training across accelerators
Created 5 months ago
Updated 3 weeks ago
Starred by Clement Delangue Clement Delangue(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
20 more.

accelerate by huggingface

0.3%
9k
PyTorch training helper for distributed execution
Created 4 years ago
Updated 1 day ago
Starred by Peter Norvig Peter Norvig(Author of "Artificial Intelligence: A Modern Approach"; Research Director at Google), Alexey Milovidov Alexey Milovidov(Cofounder of Clickhouse), and
29 more.

llm.c by karpathy

0.2%
28k
LLM training in pure C/CUDA, no PyTorch needed
Created 1 year ago
Updated 2 months ago
Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Li Jiang Li Jiang(Coauthor of AutoGen; Engineer at Microsoft), and
26 more.

ColossalAI by hpcaitech

0.1%
41k
AI system for large-scale parallel training
Created 3 years ago
Updated 13 hours ago
Feedback? Help us improve.