This repository provides the TensorFlow code and pre-trained models for BERT (Bidirectional Encoder Representations from Transformers), a deep bidirectional Transformer encoder for language representation. It enables state-of-the-art performance on a wide array of Natural Language Processing tasks by pre-training on a large text corpus and then fine-tuning for specific downstream applications. The target audience includes NLP researchers and engineers looking to leverage powerful pre-trained language models.
How It Works
BERT is pre-trained using two unsupervised tasks: Masked Language Model (MLM) and Next Sentence Prediction (NSP). MLM involves masking 15% of input tokens and training the model to predict them, forcing it to learn deep bidirectional context. NSP trains the model to predict whether two sentences follow each other sequentially in the original corpus. This deep bidirectionality, unlike previous unidirectional models, allows BERT to capture richer contextual understanding.
Quick Start & Requirements
- Install/Run: Primarily uses TensorFlow. Fine-tuning examples are provided via Python scripts (e.g.,
run_classifier.py
, run_squad.py
).
- Prerequisites: TensorFlow (tested with 1.11.0), Python 2/3. GPU or Cloud TPU recommended for fine-tuning, especially for BERT-Large.
- Resources: Fine-tuning BERT-Base on GLUE tasks can take minutes on a GPU. BERT-Large fine-tuning for SQuAD may require Cloud TPUs or careful memory management due to high RAM requirements (64GB for original experiments).
- Links: TensorFlow Hub Module, [Colab Notebook](https://colab.research.google.com/notebooks/ تpuc_tutorial.ipynb)
Highlighted Details
- Offers a variety of pre-trained models: BERT-Base/Large (cased/uncased), Whole Word Masking variants, Multilingual, and Chinese models.
- Includes code for fine-tuning on tasks like SQuAD, MultiNLI, and MRPC, as well as for feature extraction.
- Introduces smaller BERT models (BERT-Tiny, Mini, Small, Medium) for resource-constrained environments.
- Achieves state-of-the-art results on numerous NLP benchmarks, including SQuAD 1.1 and 2.0.
Maintenance & Community
- Developed by Google Research.
- Primarily maintained through GitHub issues for support.
- Third-party PyTorch and Chainer implementations are available from HuggingFace and Sosuke Kobayashi, respectively.
Licensing & Compatibility
- Released under the Apache 2.0 license.
- Compatible with commercial use and closed-source linking.
Limitations & Caveats
- BERT-Large fine-tuning can be memory-intensive, potentially exceeding the capacity of typical consumer GPUs (12-16GB RAM), necessitating gradient accumulation or gradient checkpointing (not yet implemented in the release).
- Pre-training from scratch is computationally expensive and time-consuming.
- The original C++ pre-training code is not included.