higgsfield  by higgsfield-ai

ML framework for large model training and GPU orchestration

Created 7 years ago
3,478 stars

Top 13.9% on SourcePulse

GitHubView on GitHub
Project Summary

Higgsfield is an open-source framework designed for orchestrating GPU workloads and training massive machine learning models, particularly LLMs with trillions of parameters. It targets researchers and engineers dealing with distributed training complexities, offering fault tolerance, scalability, and simplified environment management.

How It Works

Higgsfield acts as a GPU workload manager, allocating compute resources and supporting advanced sharding techniques like DeepSpeed ZeRO-3 and PyTorch's Fully Sharded Data Parallel. This approach enables efficient training of trillion-parameter models by distributing model states, gradients, and optimizer states across multiple GPUs and nodes. It integrates with CI/CD pipelines (GitHub Actions) to automate deployment and execution of training experiments.

Quick Start & Requirements

  • Install: pip install higgsfield==0.0.3
  • Requirements: Ubuntu nodes with SSH access, non-root user with passwordless sudo. Tested on Azure, LambdaLabs, FluidStack.
  • Setup: Requires node setup, environment configuration, and Git integration.
  • Links: Quick Start Guide, Tutorial

Highlighted Details

  • Supports ZeRO-3 and PyTorch Fully Sharded Data Parallel for trillion-parameter models.
  • Automates deployment and execution via GitHub Actions integration.
  • Simplifies environment management, eliminating dependency version conflicts.
  • Provides a streamlined interface for defining and managing experiments, reducing configuration complexity.

Maintenance & Community

  • Active community support via GitHub Issues and Twitter.
  • Website for discussions and news.

Licensing & Compatibility

  • License: Not explicitly stated in the README. Compatibility for commercial use or closed-source linking is therefore unclear.

Limitations & Caveats

The project is at version 0.0.3, indicating it is likely in an early development stage. The license is not specified, which may pose a barrier for commercial adoption or integration into closed-source projects.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
23 stars in the last 30 days

Explore Similar Projects

Starred by Théophile Gervet Théophile Gervet(Cofounder of Genesis AI), Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), and
6 more.

lingua by facebookresearch

0.1%
5k
LLM research codebase for training and inference
Created 11 months ago
Updated 2 months ago
Starred by Clement Delangue Clement Delangue(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
20 more.

accelerate by huggingface

0.3%
9k
PyTorch training helper for distributed execution
Created 4 years ago
Updated 1 day ago
Feedback? Help us improve.