libai by Oneflow-Inc

Large-scale distributed parallel training toolbox

Created 4 years ago

405 stars

Top 71.8% on SourcePulse

View on GitHub

3 Experts Love This Project

Yaowei Zheng

Author of LLaMA-Factory

Elvis Saravia

Founder of DAIR.AI

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

Project Summary

LiBai is a distributed parallel training toolbox for large-scale AI models, built on OneFlow. It targets researchers and engineers needing to train complex models efficiently across multiple devices and nodes, offering a flexible and modular framework for both Computer Vision and Natural Language Processing tasks.

How It Works

LiBai integrates multiple parallelism strategies (Data, Tensor, Pipeline) and training techniques (Mixed Precision, Activation Checkpointing, ZeRO) within a modular design. Its LazyConfig system allows for flexible syntax and structure, enabling users to build custom research projects or leverage its trainer and engine for streamlined development.

Quick Start & Requirements

Installation: Refer to Installation instructions.
Getting Started: See Quick Run.
Documentation: Full API documentation and tutorials are available at LiBai's documentation.
Prerequisites: Requires OneFlow 0.7.0. Specific hardware requirements (e.g., GPUs) depend on the model being trained.

Highlighted Details

Supports a comprehensive suite of parallel training components and techniques.
Offers native support for models like BLOOM, ChatGLM, Llama2, and MAE, with extensions via mocking for GPT2, LLAMA, and others.
Includes predefined data processing for CV and NLP datasets.
Provides tools for model evaluation, including lm-evaluation-harness.

Maintenance & Community

The project was last updated with Beta 0.3.0 on March 11, 2024. Community engagement is encouraged via contributions (see CONTRIBUTING). WeChat group access is available.

Licensing & Compatibility

Released under the Apache 2.0 license. This permissive license allows for commercial use and integration into closed-source projects.

Limitations & Caveats

The main branch is tied to OneFlow 0.7.0. Some models, like Stable Diffusion, are not yet fully supported for 3D parallel training.

Health Check

Last Commit

5 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History