UDA_pytorch by SanghunYun

PyTorch re-implementation of Google's UDA paper for semi-supervised learning

Created 6 years ago

278 stars

Top 93.5% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jason Wei

Research Scientist at Meta Superintelligence Lab

Project Summary

This repository provides a PyTorch implementation of Unsupervised Data Augmentation (UDA) for BERT, a semi-supervised learning technique that achieves state-of-the-art results in NLP tasks. It targets researchers and practitioners looking to improve model performance with limited labeled data, offering a significant reduction in error rates compared to fully supervised methods.

How It Works

UDA combines supervised and unsupervised losses. The supervised loss is standard cross-entropy on labeled data. The unsupervised loss uses KL-divergence between predictions on original and back-translated augmented unlabeled data. Training Signal Annealing (TSA) masks out examples with high predicted probabilities to prevent overfitting to labeled data, while confidence-based masking and softmax temperature control sharpen predictions to ensure the unsupervised loss contributes meaningfully.

Quick Start & Requirements

Install via pip install -r requirements.txt (dependencies include fire, tqdm, tensorboardX, tensorflow, pytorch, pandas, numpy).
Requires pre-trained BERT models (download via download.sh) and IMDb dataset.
Setup involves downloading models and data, then configuring json files for training or evaluation.
Official UDA paper: https://arxiv.org/abs/1904.10815
Pytorchic BERT: https://github.com/kakaobrain/pytorch-bert

Highlighted Details

Achieves 88.45% accuracy on IMDb with UDA (vs. 90% official, 68% without UDA).
Implements UDA with BERT, leveraging back-translation for data augmentation.
Incorporates Training Signal Annealing (TSA) and prediction sharpening techniques.
Includes utilities for loading TensorFlow BERT checkpoints and custom optimizers.

Maintenance & Community

Implemented by SanghunYun.
Mentions reliance on Google's UDA paper and Kakao Brain's Pytorchic BERT.
TODO section indicates plans to add pre-training code.

Licensing & Compatibility

The README does not explicitly state a license. However, it references Google's UDA and Kakao Brain's Pytorchic BERT, which have their own licenses. Users should verify compatibility for commercial or closed-source use.

Limitations & Caveats

The repository does not include code for further pre-training BERT on specific corpora, which is noted as a potential performance improvement. Users may need to integrate pre-training separately using other BERT projects.

Health Check

Last Commit

6 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days