BERT-Relation-Extraction by plkmo

PyTorch scripts for relation extraction, based on BERT

Created 6 years ago

604 stars

Top 54.2% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jeff Hammerbacher

Cofounder of Cloudera

Project Summary

This repository provides a PyTorch implementation for relation extraction using BERT and its variants (ALBERT, BioBERT), based on the "Matching the Blanks" (MTB) methodology. It targets NLP researchers and practitioners looking to leverage pre-trained language models for identifying relationships between entities in text. The primary benefit is an improved approach to relation extraction by incorporating distributional similarity learned through a novel pre-training task.

How It Works

The core approach involves a two-stage process: pre-training and fine-tuning. During pre-training, the model learns to predict masked entities within a context, effectively capturing distributional similarities between entity pairs. This is achieved by using spaCy to identify entities and construct relation statements from continuous text. For fine-tuning, the pre-trained models are adapted to specific relation extraction datasets like SemEval2010 Task 8 and FewRel.

Quick Start & Requirements

Install: python3 -m pip install -r requirements.txt
Prerequisites: Python 3.8+, bash, spaCy (python3 -m spacy download en_core_web_lg), HuggingFace BERT/ALBERT models, and optionally BioBERT models downloaded to ./additional_models.
Datasets: SemEval2010 Task 8 and FewRel 1.0 datasets need to be downloaded and placed in the ./data/ directory.
Pre-training data: A .txt file (e.g., cnn.txt) is required for pre-training.
Documentation: Towards Data Science article

Highlighted Details

Supports BERT, ALBERT, and BioBERT architectures.
Implements the "Matching the Blanks" (MTB) pre-training strategy.
Achieves 72.766% accuracy on FewRel (5-way 1-shot) with BERT-large.
Provides inference capabilities for both manually annotated and automatically detected entities.

Maintenance & Community

This is a non-official repository. The author solicits sponsorships. No community links (Discord, Slack) or roadmap are provided.

Licensing & Compatibility

The repository does not explicitly state a license. It relies on pre-trained models from HuggingFace and BioBERT, which have their own licenses. Compatibility for commercial use is not specified.

Limitations & Caveats

The README notes that the pre-training data used (CNN) is smaller than the wiki dumps used in the original paper, potentially impacting performance. The BioBERT model requires manual download and placement. The repository is marked as non-official.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days