BERT4doc-Classification by xuyige

Code for research paper on BERT fine-tuning for text classification

Created 6 years ago

639 stars

Top 52.1% on SourcePulse

Project Summary

This repository provides code and resources for fine-tuning BERT models for text classification, based on the paper "How to Fine-Tune BERT for Text Classification?". It offers a comprehensive guide for researchers and practitioners looking to optimize BERT performance on various text classification tasks.

How It Works

The project details two primary fine-tuning approaches: further pre-training on domain-specific data and fine-tuning on downstream tasks. It supports both TensorFlow and PyTorch, with utilities for converting checkpoints between frameworks. The fine-tuning process allows for flexible feature extraction by selecting specific BERT layers or concatenating outputs, and includes strategies for handling long texts and different pooling methods.

Quick Start & Requirements

Installation: Requires TensorFlow 1.x or PyTorch (0.4.1 to 1.2.0). Python 3.7 or earlier is recommended for TensorFlow 1.x compatibility.
Data Preparation: Includes scripts for preparing datasets like Sogou News and AG's News.
BERT Models: Requires downloading BERT-Base (Uncased or Chinese) checkpoints.
Further Pre-training: Scripts generate_corpus_agnews.py, create_pretraining_data.py, and run_pretraining.py are provided.
Fine-tuning: Uses run_classifier_single_layer.py and run_classifier_discriminative.py.
Checkpoint Conversion: convert_tf_checkpoint_to_pytorch.py is available.
Resources: Requires significant computational resources for pre-training and fine-tuning, including GPUs.

Highlighted Details

Investigates various fine-tuning methods for BERT on text classification.
Provides a general solution for BERT fine-tuning.
Supports layer-wise decreasing learning rates for fine-tuning.
Offers strategies for handling long texts (trunc_medium) and feature selection (layers).

Maintenance & Community

The project's last update was March 14, 2020. The code is associated with the paper "How to fine-tune BERT for text classification?" published in 2019. Further pre-trained checkpoints are available via email contact.

Licensing & Compatibility

The README does not explicitly state a license. It mentions borrowing code from Google BERT and pytorch-pretrained-bert (now transformers), implying potential licensing considerations from those projects. Compatibility for commercial use is not specified.

Limitations & Caveats

The project relies on older versions of TensorFlow (1.x) and PyTorch (<=1.2.0), which may limit compatibility with current environments and libraries. The last update was in 2020, suggesting potential maintenance gaps or unaddressed compatibility issues with newer BERT variants or frameworks.

BERT4doc-Classification by xuyige

Explore Similar Projects

fancy-nlp by boat-group

Unilm by YunwenTechnology

BERT-for-Sequence-Labeling-and-Text-Classification by yuanxiaosc

NER-BERT-pytorch by lemonhu

AzureML-BERT by microsoft

training-fine-tuning-large-language-models-workshop-dhs2024 by dipanjanS

sentiment_analysis_fine_grain by brightmart

pytorchic-bert by dhlee347

Bert-Multi-Label-Text-Classification by lonePatient

tensorflow-nlp-tutorial by ukairia777

fast-bert by appvision-ai

bert by google-research