libai  by Oneflow-Inc

Large-scale distributed parallel training toolbox

Created 3 years ago
408 stars

Top 71.5% on SourcePulse

GitHubView on GitHub
Project Summary

LiBai is a distributed parallel training toolbox for large-scale AI models, built on OneFlow. It targets researchers and engineers needing to train complex models efficiently across multiple devices and nodes, offering a flexible and modular framework for both Computer Vision and Natural Language Processing tasks.

How It Works

LiBai integrates multiple parallelism strategies (Data, Tensor, Pipeline) and training techniques (Mixed Precision, Activation Checkpointing, ZeRO) within a modular design. Its LazyConfig system allows for flexible syntax and structure, enabling users to build custom research projects or leverage its trainer and engine for streamlined development.

Quick Start & Requirements

  • Installation: Refer to Installation instructions.
  • Getting Started: See Quick Run.
  • Documentation: Full API documentation and tutorials are available at LiBai's documentation.
  • Prerequisites: Requires OneFlow 0.7.0. Specific hardware requirements (e.g., GPUs) depend on the model being trained.

Highlighted Details

  • Supports a comprehensive suite of parallel training components and techniques.
  • Offers native support for models like BLOOM, ChatGLM, Llama2, and MAE, with extensions via mocking for GPT2, LLAMA, and others.
  • Includes predefined data processing for CV and NLP datasets.
  • Provides tools for model evaluation, including lm-evaluation-harness.

Maintenance & Community

The project was last updated with Beta 0.3.0 on March 11, 2024. Community engagement is encouraged via contributions (see CONTRIBUTING). WeChat group access is available.

Licensing & Compatibility

Released under the Apache 2.0 license. This permissive license allows for commercial use and integration into closed-source projects.

Limitations & Caveats

The main branch is tied to OneFlow 0.7.0. Some models, like Stable Diffusion, are not yet fully supported for 3D parallel training.

Health Check
Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind) and Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake).

llm_training_handbook by huggingface

0%
511
Handbook for large language model training methodologies
Created 2 years ago
Updated 1 year ago
Starred by Théophile Gervet Théophile Gervet(Cofounder of Genesis AI), Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), and
6 more.

lingua by facebookresearch

0.1%
5k
LLM research codebase for training and inference
Created 11 months ago
Updated 2 months ago
Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and
25 more.

gpt-neox by EleutherAI

0.2%
7k
Framework for training large-scale autoregressive language models
Created 4 years ago
Updated 2 days ago
Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Li Jiang Li Jiang(Coauthor of AutoGen; Engineer at Microsoft), and
26 more.

ColossalAI by hpcaitech

0.1%
41k
AI system for large-scale parallel training
Created 3 years ago
Updated 13 hours ago
Feedback? Help us improve.