electra  by google-research

Text encoder pre-training via GAN-like discriminator

Created 5 years ago
2,362 stars

Top 19.4% on SourcePulse

GitHubView on GitHub
Project Summary

ELECTRA offers a self-supervised method for pre-training transformer text encoders, designed for efficiency and state-of-the-art performance. It targets researchers and practitioners in NLP who need to pre-train or fine-tune language models for downstream tasks like classification, question answering, and sequence tagging. The core benefit is achieving strong results with significantly less compute compared to generator-based pre-training methods.

How It Works

ELECTRA trains models as discriminators that distinguish between "real" input tokens and "fake" tokens generated by a smaller, auxiliary network. This "replaced token detection" objective is more sample-efficient than traditional masked language modeling, allowing for faster pre-training and better performance with limited compute. The repository also includes code for "Electric," an energy-based variant for more principled negative sampling.

Quick Start & Requirements

  • Install: Requires Python 3, TensorFlow 1.15, NumPy, scikit-learn, and SciPy.
  • Pre-training:
    • Prepare data using build_pretraining_dataset.py (requires BERT's vocabulary).
    • Train using run_pretraining.py. A small model trained on OpenWebText takes ~4 days on a V100 GPU.
    • Pre-training data (tfrecords) requires ~30GB disk space.
  • Fine-tuning:
    • Download pre-trained models or train your own.
    • Fine-tune using run_finetuning.py for tasks like GLUE, SQuAD, and sequence tagging.
  • Links: Official Paper, Electric Paper, BERT Vocabulary

Highlighted Details

  • ELECTRA-Large achieves 85.2 GLUE score, outperforming ALBERT/XLNET.
  • ELECTRA-Base (82.7 GLUE) outperforms BERT-Large.
  • ELECTRA-Small (77.4 GLUE) offers competitive performance without distillation.
  • Supports fine-tuning on GLUE, SQuAD (1.1 & 2.0), MRQA, and sequence tagging tasks.

Maintenance & Community

  • Developed by Google Research.
  • Contact: Kevin Clark (kevclark@cs.stanford.edu) for personal communication. Submit GitHub issues for support.

Licensing & Compatibility

  • Apache 2.0 License. Permissive for commercial use and integration with closed-source projects.

Limitations & Caveats

  • Requires TensorFlow 1.15; TensorFlow 2.0 support is planned but not guaranteed.
  • The original pre-training dataset used in the paper is not publicly available, requiring users to source their own data or use alternatives like OpenWebText.
Health Check
Last Commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.