PyTorch re-implementation of Google's UDA paper for semi-supervised learning
Top 94.3% on sourcepulse
This repository provides a PyTorch implementation of Unsupervised Data Augmentation (UDA) for BERT, a semi-supervised learning technique that achieves state-of-the-art results in NLP tasks. It targets researchers and practitioners looking to improve model performance with limited labeled data, offering a significant reduction in error rates compared to fully supervised methods.
How It Works
UDA combines supervised and unsupervised losses. The supervised loss is standard cross-entropy on labeled data. The unsupervised loss uses KL-divergence between predictions on original and back-translated augmented unlabeled data. Training Signal Annealing (TSA) masks out examples with high predicted probabilities to prevent overfitting to labeled data, while confidence-based masking and softmax temperature control sharpen predictions to ensure the unsupervised loss contributes meaningfully.
Quick Start & Requirements
pip install -r requirements.txt
(dependencies include fire
, tqdm
, tensorboardX
, tensorflow
, pytorch
, pandas
, numpy
).download.sh
) and IMDb dataset.json
files for training or evaluation.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The repository does not include code for further pre-training BERT on specific corpora, which is noted as a potential performance improvement. Users may need to integrate pre-training separately using other BERT projects.
5 years ago
Inactive