bert_language_understanding  by brightmart

Pre-training toolkit for language understanding tasks

created 6 years ago
964 stars

Top 39.0% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a TensorFlow implementation inspired by BERT and the Transformer architecture, focusing on pre-training and fine-tuning strategies for Natural Language Understanding (NLU) tasks. It aims to simplify the adoption of these powerful techniques, particularly by demonstrating that pre-training with a Masked Language Model (MLM) can significantly boost performance even when using simpler backbone architectures like TextCNN, and on smaller datasets.

How It Works

The core idea is to leverage the pre-train and fine-tune paradigm, which is presented as model- and task-independent. The implementation includes a Masked Language Model (MLM) pre-training task, where words are masked and the model learns to reconstruct them from context. This is followed by a fine-tuning stage for specific downstream tasks. A key differentiator is the successful application of this pre-training strategy to a TextCNN backbone, showing substantial performance gains and faster convergence compared to training from scratch.

Quick Start & Requirements

  • Pre-train: python train_bert_lm.py
  • Fine-tune: python train_bert_fine_tuning.py
  • Prerequisites: Python 3+, TensorFlow 1.10.
  • Data: Requires raw text data for pre-training and labeled data for fine-tuning. Sample data and links to datasets are provided.
  • Resources: Pre-training on a single GPU for 5 hours with 2 million training examples (derived from 450k documents) is reported. Fine-tuning is significantly faster.

Highlighted Details

  • Demonstrates significant F1 score improvements (up to 0.49) and reduced training loss via pre-training on TextCNN.
  • Pre-training can accelerate fine-tuning convergence, sometimes requiring only a few epochs.
  • Offers flexibility to replace the backbone network and add custom pre-training tasks.
  • Includes a "toy task" for basic model verification.

Maintenance & Community

  • The project actively seeks contributions, particularly for fixing bugs in the Transformer implementation and adding sentence pair tasks or next sentence prediction pre-training.
  • Contact email provided for suggestions and contributions.

Licensing & Compatibility

  • The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

  • A bug exists in the Transformer implementation preventing convergence.
  • Position embeddings are not yet shared between pre-train and fine-tuning stages.
  • The project is actively seeking contributors for several key features and bug fixes.
Health Check
Last commit

6 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 90 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake), Abhishek Thakur Abhishek Thakur(World's First 4x Kaggle GrandMaster), and
5 more.

xlnet by zihangdai

0.0%
6k
Language model research paper using generalized autoregressive pretraining
created 6 years ago
updated 2 years ago
Feedback? Help us improve.