UDA_pytorch  by SanghunYun

PyTorch re-implementation of Google's UDA paper for semi-supervised learning

created 6 years ago
278 stars

Top 94.3% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a PyTorch implementation of Unsupervised Data Augmentation (UDA) for BERT, a semi-supervised learning technique that achieves state-of-the-art results in NLP tasks. It targets researchers and practitioners looking to improve model performance with limited labeled data, offering a significant reduction in error rates compared to fully supervised methods.

How It Works

UDA combines supervised and unsupervised losses. The supervised loss is standard cross-entropy on labeled data. The unsupervised loss uses KL-divergence between predictions on original and back-translated augmented unlabeled data. Training Signal Annealing (TSA) masks out examples with high predicted probabilities to prevent overfitting to labeled data, while confidence-based masking and softmax temperature control sharpen predictions to ensure the unsupervised loss contributes meaningfully.

Quick Start & Requirements

  • Install via pip install -r requirements.txt (dependencies include fire, tqdm, tensorboardX, tensorflow, pytorch, pandas, numpy).
  • Requires pre-trained BERT models (download via download.sh) and IMDb dataset.
  • Setup involves downloading models and data, then configuring json files for training or evaluation.
  • Official UDA paper: https://arxiv.org/abs/1904.10815
  • Pytorchic BERT: https://github.com/kakaobrain/pytorch-bert

Highlighted Details

  • Achieves 88.45% accuracy on IMDb with UDA (vs. 90% official, 68% without UDA).
  • Implements UDA with BERT, leveraging back-translation for data augmentation.
  • Incorporates Training Signal Annealing (TSA) and prediction sharpening techniques.
  • Includes utilities for loading TensorFlow BERT checkpoints and custom optimizers.

Maintenance & Community

  • Implemented by SanghunYun.
  • Mentions reliance on Google's UDA paper and Kakao Brain's Pytorchic BERT.
  • TODO section indicates plans to add pre-training code.

Licensing & Compatibility

  • The README does not explicitly state a license. However, it references Google's UDA and Kakao Brain's Pytorchic BERT, which have their own licenses. Users should verify compatibility for commercial or closed-source use.

Limitations & Caveats

The repository does not include code for further pre-training BERT on specific corpora, which is noted as a potential performance improvement. Users may need to integrate pre-training separately using other BERT projects.

Health Check
Last commit

5 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.