electra  by google-research

Text encoder pre-training via GAN-like discriminator

created 5 years ago
2,358 stars

Top 19.9% on sourcepulse

GitHubView on GitHub
Project Summary

ELECTRA offers a self-supervised method for pre-training transformer text encoders, designed for efficiency and state-of-the-art performance. It targets researchers and practitioners in NLP who need to pre-train or fine-tune language models for downstream tasks like classification, question answering, and sequence tagging. The core benefit is achieving strong results with significantly less compute compared to generator-based pre-training methods.

How It Works

ELECTRA trains models as discriminators that distinguish between "real" input tokens and "fake" tokens generated by a smaller, auxiliary network. This "replaced token detection" objective is more sample-efficient than traditional masked language modeling, allowing for faster pre-training and better performance with limited compute. The repository also includes code for "Electric," an energy-based variant for more principled negative sampling.

Quick Start & Requirements

  • Install: Requires Python 3, TensorFlow 1.15, NumPy, scikit-learn, and SciPy.
  • Pre-training:
    • Prepare data using build_pretraining_dataset.py (requires BERT's vocabulary).
    • Train using run_pretraining.py. A small model trained on OpenWebText takes ~4 days on a V100 GPU.
    • Pre-training data (tfrecords) requires ~30GB disk space.
  • Fine-tuning:
    • Download pre-trained models or train your own.
    • Fine-tune using run_finetuning.py for tasks like GLUE, SQuAD, and sequence tagging.
  • Links: Official Paper, Electric Paper, BERT Vocabulary

Highlighted Details

  • ELECTRA-Large achieves 85.2 GLUE score, outperforming ALBERT/XLNET.
  • ELECTRA-Base (82.7 GLUE) outperforms BERT-Large.
  • ELECTRA-Small (77.4 GLUE) offers competitive performance without distillation.
  • Supports fine-tuning on GLUE, SQuAD (1.1 & 2.0), MRQA, and sequence tagging tasks.

Maintenance & Community

  • Developed by Google Research.
  • Contact: Kevin Clark (kevclark@cs.stanford.edu) for personal communication. Submit GitHub issues for support.

Licensing & Compatibility

  • Apache 2.0 License. Permissive for commercial use and integration with closed-source projects.

Limitations & Caveats

  • Requires TensorFlow 1.15; TensorFlow 2.0 support is planned but not guaranteed.
  • The original pre-training dataset used in the paper is not publicly available, requiring users to source their own data or use alternatives like OpenWebText.
Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Alex Cheema Alex Cheema(Cofounder of EXO Labs), and
1 more.

recurrent-pretraining by seal-rg

0.1%
806
Pretraining code for depth-recurrent language model research
created 5 months ago
updated 2 weeks ago
Feedback? Help us improve.