PyTorch scripts for relation extraction, based on BERT
Top 55.3% on sourcepulse
This repository provides a PyTorch implementation for relation extraction using BERT and its variants (ALBERT, BioBERT), based on the "Matching the Blanks" (MTB) methodology. It targets NLP researchers and practitioners looking to leverage pre-trained language models for identifying relationships between entities in text. The primary benefit is an improved approach to relation extraction by incorporating distributional similarity learned through a novel pre-training task.
How It Works
The core approach involves a two-stage process: pre-training and fine-tuning. During pre-training, the model learns to predict masked entities within a context, effectively capturing distributional similarities between entity pairs. This is achieved by using spaCy to identify entities and construct relation statements from continuous text. For fine-tuning, the pre-trained models are adapted to specific relation extraction datasets like SemEval2010 Task 8 and FewRel.
Quick Start & Requirements
python3 -m pip install -r requirements.txt
python3 -m spacy download en_core_web_lg
), HuggingFace BERT/ALBERT models, and optionally BioBERT models downloaded to ./additional_models
../data/
directory..txt
file (e.g., cnn.txt
) is required for pre-training.Highlighted Details
Maintenance & Community
This is a non-official repository. The author solicits sponsorships. No community links (Discord, Slack) or roadmap are provided.
Licensing & Compatibility
The repository does not explicitly state a license. It relies on pre-trained models from HuggingFace and BioBERT, which have their own licenses. Compatibility for commercial use is not specified.
Limitations & Caveats
The README notes that the pre-training data used (CNN) is smaller than the wiki dumps used in the original paper, potentially impacting performance. The BioBERT model requires manual download and placement. The repository is marked as non-official.
1 year ago
Inactive