op_pangu by HongriJiujiu

Models for diagnosing silent inconsistency in distributed fine-tuning

Created 4 months ago

464 stars

Top 64.6% on SourcePulse

Project Summary

Summary

This repository offers three experimental fine-tuned models designed to diagnose "Silent Inconsistency" in synchronous data-parallel (DDP) full-parameter fine-tuning. It targets researchers and engineers working with distributed training, providing a method to detect subtle worker-level optimization divergences invisible in global metrics, thereby enhancing training reliability.

How It Works

The project addresses hidden divergences in worker-level optimization dynamics during synchronous DDP training, where parameter synchronization doesn't guarantee consistent internal states. It introduces three lightweight, online monitoring metrics—Loss Dispersion, Gradient-Norm Dispersion, and Gradient-Direction Consistency—computable with negligible overhead. These metrics diagnose per-worker loss and gradient behavior invisible in global loss curves, offering a novel debugging approach for distributed training.

Quick Start & Requirements

Fine-tuned models are available on Hugging Face (https://huggingface.co/jiujiudahaozi/op_pangu). They are fully fine-tuned from openPangu-Embedded-1B-V1.1 (~1B parameters) using bf16 mixed precision on the tatsu-lab/alpaca dataset (https://huggingface.co/datasets/tatsu-lab/alpaca). Training used an Instruction-Input-Response template, max sequence length 1024, with loss computed only on response tokens. Inference requires a suitable GPU environment for a 1B parameter model.

Highlighted Details

Three models (S1-1, S1-2, S1-3) demonstrate varying degrees of worker-level optimization inconsistency.
Base model: openPangu-Embedded-1B-V1.1 causal LM (~1B parameters), trained with bf16 mixed precision.
Fine-tuned on the tatsu-lab/alpaca instruction dataset.
Introduces lightweight metrics for diagnosing silent inconsistencies in DDP.

Maintenance & Community

Contributors include Hong Li, Zhen Zhou, Honggang Zhang, Yuping Luo, Xinyue Wang, Han Gong, and Zhiyuan Liu. No community channels or roadmap links are provided.

Licensing & Compatibility

The README omits license information, precluding assessment of commercial use or closed-source linking compatibility without clarification.

Limitations & Caveats

This repository provides experimental models for diagnosing DDP silent inconsistencies, not training scripts or a general diagnostic toolkit. Its focus is on the phenomenon and resulting models, not on enabling users to reproduce experiments or apply diagnostics broadly.

op_pangu by HongriJiujiu

Explore Similar Projects

awesome-deep-phenomena by MinghuiChen43

awesomeMLSys by gpu-mode

Intuitor by sunblaze-ucb

llm_training_handbook by huggingface

distributed-training-guide by LambdaLabsML

MINI_LLM by jiahe7ay

OpenMoE by XueFuzhao

BMTrain by OpenBMB

pytorch-lr-finder by davidtvs

WeightWatcher by CalculatedContent

DataFlex by OpenDCAI

ColossalAI by hpcaitech