BertSum by nlpyang

Code for extractive summarization via fine-tuned BERT

Created 6 years ago

1,504 stars

Top 27.3% on SourcePulse

View on GitHub

2 Experts Love This Project

Malte Pietsch

Cofounder of deepset

Chuan Li

Chief Scientific Officer at Lambda

Project Summary

This repository provides the code for fine-tuning BERT for extractive summarization, targeting researchers and practitioners in Natural Language Processing. It offers improved ROUGE scores over baseline models on the CNN/Dailymail dataset by integrating BERT with different decoder architectures.

How It Works

BertSum leverages BERT's contextual embeddings to identify salient sentences for extractive summarization. It explores three encoder-decoder configurations: a simple classifier, a Transformer, and an RNN. The BERT+Transformer variant, in particular, achieves state-of-the-art results by using BERT for sentence encoding and a Transformer for sequence decoding, allowing for effective capture of long-range dependencies.

Quick Start & Requirements

Install: Requires Python 3.6.
Dependencies: PyTorch, pytorch_pretrained_bert, tensorboardX, multiprocess, pyrouge.
Data Prep: Involves downloading pre-processed data or tokenizing raw stories using Stanford CoreNLP, followed by formatting into PyTorch-compatible binary files. This process can be time-consuming.
Training: Commands are provided for training BERT+Classifier, BERT+Transformer, and BERT+RNN models, with options for single or multi-GPU training.
Evaluation: Scripts are available for model evaluation using ROUGE scores.

Highlighted Details

Achieves ROUGE-1: 43.25, ROUGE-2: 20.24, ROUGE-L: 39.63 on CNN/Dailymail with the BERT+Transformer model.
Supports multiple encoder architectures (Classifier, Transformer, RNN) for summarization.
Includes detailed data preparation steps and training scripts.

Maintenance & Community

The project is associated with a research paper, indicating a focus on academic contributions. No specific community channels or active maintenance signals are explicitly mentioned in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the provided README. Compatibility for commercial use or closed-source linking would require clarification of the licensing terms.

Limitations & Caveats

The data preparation process is complex and requires external tools like Stanford CoreNLP. The README does not specify the exact BERT model used (e.g., base, large) or provide pre-trained models for direct use, necessitating custom training.

Health Check

Last Commit

4 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days