libai  by Oneflow-Inc

Large-scale distributed parallel training toolbox

created 3 years ago
409 stars

Top 72.3% on sourcepulse

GitHubView on GitHub
Project Summary

LiBai is a distributed parallel training toolbox for large-scale AI models, built on OneFlow. It targets researchers and engineers needing to train complex models efficiently across multiple devices and nodes, offering a flexible and modular framework for both Computer Vision and Natural Language Processing tasks.

How It Works

LiBai integrates multiple parallelism strategies (Data, Tensor, Pipeline) and training techniques (Mixed Precision, Activation Checkpointing, ZeRO) within a modular design. Its LazyConfig system allows for flexible syntax and structure, enabling users to build custom research projects or leverage its trainer and engine for streamlined development.

Quick Start & Requirements

  • Installation: Refer to Installation instructions.
  • Getting Started: See Quick Run.
  • Documentation: Full API documentation and tutorials are available at LiBai's documentation.
  • Prerequisites: Requires OneFlow 0.7.0. Specific hardware requirements (e.g., GPUs) depend on the model being trained.

Highlighted Details

  • Supports a comprehensive suite of parallel training components and techniques.
  • Offers native support for models like BLOOM, ChatGLM, Llama2, and MAE, with extensions via mocking for GPT2, LLAMA, and others.
  • Includes predefined data processing for CV and NLP datasets.
  • Provides tools for model evaluation, including lm-evaluation-harness.

Maintenance & Community

The project was last updated with Beta 0.3.0 on March 11, 2024. Community engagement is encouraged via contributions (see CONTRIBUTING). WeChat group access is available.

Licensing & Compatibility

Released under the Apache 2.0 license. This permissive license allows for commercial use and integration into closed-source projects.

Limitations & Caveats

The main branch is tied to OneFlow 0.7.0. Some models, like Stable Diffusion, are not yet fully supported for 3D parallel training.

Health Check
Last commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.