op_pangu  by HongriJiujiu

Models for diagnosing silent inconsistency in distributed fine-tuning

Created 2 months ago
534 stars

Top 58.8% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

This repository offers three experimental fine-tuned models designed to diagnose "Silent Inconsistency" in synchronous data-parallel (DDP) full-parameter fine-tuning. It targets researchers and engineers working with distributed training, providing a method to detect subtle worker-level optimization divergences invisible in global metrics, thereby enhancing training reliability.

How It Works

The project addresses hidden divergences in worker-level optimization dynamics during synchronous DDP training, where parameter synchronization doesn't guarantee consistent internal states. It introduces three lightweight, online monitoring metrics—Loss Dispersion, Gradient-Norm Dispersion, and Gradient-Direction Consistency—computable with negligible overhead. These metrics diagnose per-worker loss and gradient behavior invisible in global loss curves, offering a novel debugging approach for distributed training.

Quick Start & Requirements

Fine-tuned models are available on Hugging Face (https://huggingface.co/jiujiudahaozi/op_pangu). They are fully fine-tuned from openPangu-Embedded-1B-V1.1 (~1B parameters) using bf16 mixed precision on the tatsu-lab/alpaca dataset (https://huggingface.co/datasets/tatsu-lab/alpaca). Training used an Instruction-Input-Response template, max sequence length 1024, with loss computed only on response tokens. Inference requires a suitable GPU environment for a 1B parameter model.

Highlighted Details

  • Three models (S1-1, S1-2, S1-3) demonstrate varying degrees of worker-level optimization inconsistency.
  • Base model: openPangu-Embedded-1B-V1.1 causal LM (~1B parameters), trained with bf16 mixed precision.
  • Fine-tuned on the tatsu-lab/alpaca instruction dataset.
  • Introduces lightweight metrics for diagnosing silent inconsistencies in DDP.

Maintenance & Community

Contributors include Hong Li, Zhen Zhou, Honggang Zhang, Yuping Luo, Xinyue Wang, Han Gong, and Zhiyuan Liu. No community channels or roadmap links are provided.

Licensing & Compatibility

The README omits license information, precluding assessment of commercial use or closed-source linking compatibility without clarification.

Limitations & Caveats

This repository provides experimental models for diagnosing DDP silent inconsistencies, not training scripts or a general diagnostic toolkit. Its focus is on the phenomenon and resulting models, not on enabling users to reproduce experiments or apply diagnostics broadly.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind) and Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake).

llm_training_handbook by huggingface

0%
559
Handbook for large language model training methodologies
Created 3 years ago
Updated 2 years ago
Feedback? Help us improve.