BertSum  by nlpyang

Code for extractive summarization via fine-tuned BERT

created 6 years ago
1,489 stars

Top 28.2% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides the code for fine-tuning BERT for extractive summarization, targeting researchers and practitioners in Natural Language Processing. It offers improved ROUGE scores over baseline models on the CNN/Dailymail dataset by integrating BERT with different decoder architectures.

How It Works

BertSum leverages BERT's contextual embeddings to identify salient sentences for extractive summarization. It explores three encoder-decoder configurations: a simple classifier, a Transformer, and an RNN. The BERT+Transformer variant, in particular, achieves state-of-the-art results by using BERT for sentence encoding and a Transformer for sequence decoding, allowing for effective capture of long-range dependencies.

Quick Start & Requirements

  • Install: Requires Python 3.6.
  • Dependencies: PyTorch, pytorch_pretrained_bert, tensorboardX, multiprocess, pyrouge.
  • Data Prep: Involves downloading pre-processed data or tokenizing raw stories using Stanford CoreNLP, followed by formatting into PyTorch-compatible binary files. This process can be time-consuming.
  • Training: Commands are provided for training BERT+Classifier, BERT+Transformer, and BERT+RNN models, with options for single or multi-GPU training.
  • Evaluation: Scripts are available for model evaluation using ROUGE scores.

Highlighted Details

  • Achieves ROUGE-1: 43.25, ROUGE-2: 20.24, ROUGE-L: 39.63 on CNN/Dailymail with the BERT+Transformer model.
  • Supports multiple encoder architectures (Classifier, Transformer, RNN) for summarization.
  • Includes detailed data preparation steps and training scripts.

Maintenance & Community

The project is associated with a research paper, indicating a focus on academic contributions. No specific community channels or active maintenance signals are explicitly mentioned in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the provided README. Compatibility for commercial use or closed-source linking would require clarification of the licensing terms.

Limitations & Caveats

The data preparation process is complex and requires external tools like Stanford CoreNLP. The README does not specify the exact BERT model used (e.g., base, large) or provide pre-trained models for direct use, necessitating custom training.

Health Check
Last commit

3 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.