BERT4doc-Classification  by xuyige

Code for research paper on BERT fine-tuning for text classification

created 5 years ago
637 stars

Top 53.0% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides code and resources for fine-tuning BERT models for text classification, based on the paper "How to Fine-Tune BERT for Text Classification?". It offers a comprehensive guide for researchers and practitioners looking to optimize BERT performance on various text classification tasks.

How It Works

The project details two primary fine-tuning approaches: further pre-training on domain-specific data and fine-tuning on downstream tasks. It supports both TensorFlow and PyTorch, with utilities for converting checkpoints between frameworks. The fine-tuning process allows for flexible feature extraction by selecting specific BERT layers or concatenating outputs, and includes strategies for handling long texts and different pooling methods.

Quick Start & Requirements

  • Installation: Requires TensorFlow 1.x or PyTorch (0.4.1 to 1.2.0). Python 3.7 or earlier is recommended for TensorFlow 1.x compatibility.
  • Data Preparation: Includes scripts for preparing datasets like Sogou News and AG's News.
  • BERT Models: Requires downloading BERT-Base (Uncased or Chinese) checkpoints.
  • Further Pre-training: Scripts generate_corpus_agnews.py, create_pretraining_data.py, and run_pretraining.py are provided.
  • Fine-tuning: Uses run_classifier_single_layer.py and run_classifier_discriminative.py.
  • Checkpoint Conversion: convert_tf_checkpoint_to_pytorch.py is available.
  • Resources: Requires significant computational resources for pre-training and fine-tuning, including GPUs.

Highlighted Details

  • Investigates various fine-tuning methods for BERT on text classification.
  • Provides a general solution for BERT fine-tuning.
  • Supports layer-wise decreasing learning rates for fine-tuning.
  • Offers strategies for handling long texts (trunc_medium) and feature selection (layers).

Maintenance & Community

The project's last update was March 14, 2020. The code is associated with the paper "How to fine-tune BERT for text classification?" published in 2019. Further pre-trained checkpoints are available via email contact.

Licensing & Compatibility

The README does not explicitly state a license. It mentions borrowing code from Google BERT and pytorch-pretrained-bert (now transformers), implying potential licensing considerations from those projects. Compatibility for commercial use is not specified.

Limitations & Caveats

The project relies on older versions of TensorFlow (1.x) and PyTorch (<=1.2.0), which may limit compatibility with current environments and libraries. The last update was in 2020, suggesting potential maintenance gaps or unaddressed compatibility issues with newer BERT variants or frameworks.

Health Check
Last commit

3 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
9 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.