bert_language_understanding by brightmart

Pre-training toolkit for language understanding tasks

Created 7 years ago

969 stars

Top 38.1% on SourcePulse

1 Expert Loves This Project

evhub

Head of Alignment Stress-Testing at Anthropic

Project Summary

This repository provides a TensorFlow implementation inspired by BERT and the Transformer architecture, focusing on pre-training and fine-tuning strategies for Natural Language Understanding (NLU) tasks. It aims to simplify the adoption of these powerful techniques, particularly by demonstrating that pre-training with a Masked Language Model (MLM) can significantly boost performance even when using simpler backbone architectures like TextCNN, and on smaller datasets.

How It Works

The core idea is to leverage the pre-train and fine-tune paradigm, which is presented as model- and task-independent. The implementation includes a Masked Language Model (MLM) pre-training task, where words are masked and the model learns to reconstruct them from context. This is followed by a fine-tuning stage for specific downstream tasks. A key differentiator is the successful application of this pre-training strategy to a TextCNN backbone, showing substantial performance gains and faster convergence compared to training from scratch.

Quick Start & Requirements

Pre-train: python train_bert_lm.py
Fine-tune: python train_bert_fine_tuning.py
Prerequisites: Python 3+, TensorFlow 1.10.
Data: Requires raw text data for pre-training and labeled data for fine-tuning. Sample data and links to datasets are provided.
Resources: Pre-training on a single GPU for 5 hours with 2 million training examples (derived from 450k documents) is reported. Fine-tuning is significantly faster.

Highlighted Details

Demonstrates significant F1 score improvements (up to 0.49) and reduced training loss via pre-training on TextCNN.
Pre-training can accelerate fine-tuning convergence, sometimes requiring only a few epochs.
Offers flexibility to replace the backbone network and add custom pre-training tasks.
Includes a "toy task" for basic model verification.

Maintenance & Community

The project actively seeks contributions, particularly for fixing bugs in the Transformer implementation and adding sentence pair tasks or next sentence prediction pre-training.
Contact email provided for suggestions and contributions.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

A bug exists in the Transformer implementation preventing convergence.
Position embeddings are not yet shared between pre-train and fine-tuning stages.
The project is actively seeking contributors for several key features and bug fixes.

Health Check

Last Commit

7 years ago

Responsiveness

1 day

Pull Requests (30d)

0

Issues (30d)

0

Star History

0 stars in the last 30 days

Explore Similar Projects

kanana by kakao

Bilingual language models for Korean/English, compute-efficient vs. SOTA

Created 10 months ago

Updated 5 months ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory).

ContinualLM by UIC-Liu-Lab

PyTorch framework for continual learning of language models

Created 2 years ago

Updated 1 year ago

Starred by

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera).

MPNet by microsoft

Language model pre-training toolkit

Created 5 years ago

Updated 4 years ago

Starred by

Robert Stojnic

Robert Stojnic(Cocreator of Papers with Code).

finetune by IndicoDataSolutions

NLP finetuning library with scikit-learn style API

Created 7 years ago

Updated 2 months ago

Starred by

Clement Delangue

Clement Delangue(Cofounder of Hugging Face),

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory), and

5 more.

naacl_transfer_learning_tutorial by huggingface

NLP transfer learning tutorial code

Created 6 years ago

Updated 6 years ago

Mastering-Transformers by PacktPublishing

Code repository for NLP book "Mastering Transformers"

Created 4 years ago

Updated 3 weeks ago

SpanBERT by facebookresearch

SpanBERT is a research paper implementation for improved pre-training

Created 6 years ago

Updated 2 years ago

Starred by

Junyang Lin

Junyang Lin(Core Maintainer at Alibaba Qwen).

vilbert-multi-task by facebookresearch

Vision-language representation learning research paper & models

Created 6 years ago

Updated 3 years ago

BERT-keras by Separius

Keras implementation for BERT and Transformer LM research

Created 7 years ago

Updated 6 years ago

kospeech by sooftware

PyTorch library for end-to-end Korean ASR research

Created 6 years ago

Updated 2 years ago

Bert-Multi-Label-Text-Classification by lonePatient

PyTorch code for multi-label text classification

Created 7 years ago

Updated 2 years ago

Starred by

Jeremy Howard

Jeremy Howard(Cofounder of fast.ai),

Alex Cheema

Alex Cheema(Cofounder of EXO Labs), and

22 more.

unilm by microsoft

Foundation models for language, vision, speech, and multimodal tasks

Created 6 years ago

Updated 3 weeks ago

Feedback? Help us improve.