AzureML-BERT by microsoft

End-to-end recipes for BERT pre-training and fine-tuning

Created 7 years ago

402 stars

Top 72.1% on SourcePulse

View on GitHub

2 Experts Love This Project

Jeff Hammerbacher

Cofounder of Cloudera

Thomas Wolf

Cofounder of Hugging Face

Project Summary

This repository provides end-to-end recipes for pre-training and fine-tuning BERT models using Azure Machine Learning Service. It targets NLP researchers and engineers who need to build custom language representation models on domain-specific data or adapt existing BERT models for specialized tasks, offering a stable and predictable workflow for large-scale distributed training.

How It Works

The pre-training recipe leverages PyTorch and Hugging Face's BERT v0.6.2, incorporating optimization techniques like gradient accumulation and mixed-precision training to handle large models and datasets efficiently. The fine-tuning recipe demonstrates adapting pre-trained checkpoints to downstream tasks, specifically evaluating against the GLUE benchmark using Azure ML. This approach aims to simplify the complex process of distributed training and model configuration for large language models.

Quick Start & Requirements

Install/Run: Primarily through Jupyter notebooks using the AzureML Python SDK.
Prerequisites: Azure Machine Learning service account, GPU-enabled hardware (for training), Python, PyTorch, Hugging Face Transformers.
Resources: Training BERT-large (340M parameters) requires significant GPU resources and large datasets.
Links:
- Pretraining: https://github.com/microsoft/onnxruntime-training-examples/tree/master/nvidia-bert (noted as significantly faster)
- ONNX Runtime Training: ONNX Runtime Training Technical Deep Dive

Highlighted Details

End-to-end recipes for both pre-training from scratch and fine-tuning BERT.
Includes data preprocessing scripts for repeatability and custom corpus usage.
Supports distributed training with gradient accumulation and mixed-precision.
Provides notebooks for evaluating against the GLUE benchmark.

Maintenance & Community

This repository is from Microsoft. A note from 7/7/2020 indicates a more recent and significantly faster implementation for BERT pretraining is available using ONNX Runtime.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README text. Compatibility for commercial use or closed-source linking would depend on the specific license chosen for the repository.

Limitations & Caveats

The README explicitly points to a more recent, significantly faster implementation using ONNX Runtime for BERT pretraining, suggesting this repository's pretraining recipe may be outdated or less performant. The setup requires an Azure ML service account and potentially substantial GPU resources.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days